Inferact: Building the Infrastructure That Runs Modern AI

Key Takeaways

Inferact aims to build a universal, open-source AI inference layer for faster, cheaper, and more reliable models.
vLLM originated from UC Berkeley PhD research in 2022 to address unique large language model challenges.
LLM inference is complicated by dynamic input lengths, stochastic outputs, and increasing model/hardware diversity.
The vLLM project rapidly grew to over 2,000 GitHub contributors, including major tech and silicon providers.
Open-source infrastructure is critical for AI, enabling deep stack tuning and fostering innovation.

Inferact founders and vLLM creators aim to establish vLLM as a universal, open-source inference engine.
vLLM originated from co-founder Woosuk Kwon's 2022 PhD research prototype at UC Berkeley, evolving into a widely used open-source runtime.
The project initially focused on optimizing a demo service for Meta's 175-billion-parameter OPT model.
This research explored the unique challenges of autoregressive language models, diverging from traditional AI workloads.

LLM inference workloads are characterized by dynamic input lengths, ranging from single words to hundreds of pages.
This dynamism contrasts with traditional deep learning, which uses static, manageable workloads for batched processing.
Core technical challenges in LLM inference include scheduling and memory management, moving beyond traditional micro-batching.
The stochastic and continuous nature of LLM output, where the model determines its stopping point, further complicates inference.

The vLLM open-source project grew from a few graduate students to over 50 regular and 2,000 total contributors on GitHub.
It is one of the fastest-growing open-source projects, attracting contributors from Meta, Red Hat, NVIDIA, AMD, Google, AWS, and Intel.
The community addresses the 'M times M' problem by creating a universal inference layer for all models across various hardware.
vLLM's management philosophy includes clear team scopes, objectives, and milestones, drawing lessons from projects like Kubernetes and Linux.

An inference engine runs pre-trained models on accelerated hardware to efficiently generate outputs like text and images.
The vLLM architecture includes an API server, a tokenizer, a scheduler for batching requests, and a memory manager for KV cache.
A dedicated worker component initializes and runs the AI model within the inference engine.
Operational costs for vLLM are significant, with a CI bill exceeding $100,000 and an annual burn rate of approximately $1 million.

The difficulty of running inference has significantly increased over the past 1.5 years due to rising model scale, diversity, and agents.
Models like Mixtral 8x22B have over 100 billion parameters, with multi-trillion parameter open-source models predicted for this year.
Running larger models necessitates distributing them across multiple GPUs and nodes, posing challenges in sharding and load balancing.
vLLM is used on an estimated 400,000 to 500,000 GPUs globally, running 24/7 across diverse GPU architectures.

AI infrastructure is increasingly complex due to diverse models, hardware, deployment scenarios, and emerging agent challenges.
Supporting agents requires new infrastructure for tool calling and multi-agent interactions, disrupting traditional text-in/text-out paradigms.
Agent applications involve multi-turn conversations, external tool use, and variable interaction times, from seconds to hours.
Effective cache management is critical in dynamic agent environments, as unpredictable interactions complicate cache invalidation.

Inferact's mission is to build a universal, open-source inference layer to make AI models faster and more reliable across any hardware or deployment environment.
Open-source AI is viewed as critical for the AI infrastructure ecosystem and a 'secret weapon' for community-driven execution.
The belief in open-source AI emphasizes diversity in models and hardware architectures as key to addressing real-world applications.
Open source enables enterprises to perform deep stack tuning for specific use cases, unlike limitations with closed-source models.

Jan, a co-founder of Databricks and advisor to vLLM, is also an Inferact co-founder, advising on open-source adoption and talent.
Inferact is actively hiring experienced ML infrastructure engineers to optimize hardware like GB200 and MBL72 for giant open-source models.
The company focuses on building a horizontal abstraction layer for accelerated computing, akin to operating systems and databases.
This universal layer aims to abstract GPUs and other accelerated computing devices for AI models, improving speed, cost, and reliability.