Inferact: Building the Infrastructure That Runs Modern AI

Key Takeaways

Inferact, founded by vLLM creators, aims to build a universal, open-source AI inference layer.
AI inference has become a complex problem due to dynamic workloads, diverse models, and agentic systems.
vLLM, an open-source project, provides a universal inference solution, supported by a large community and major tech companies.
The open-source approach is deemed critical for innovation, enabling developers to build upon common ground and ensure hardware compatibility.
Optimizing AI performance for enterprise use cases requires control over the entire stack, including specialized hardware and applications.

Inferact was founded by vLLM creators Simon Moe and Woosuk Kwon with the mission to build a universal, open-source inference layer.
This layer aims to make large AI models faster, cheaper, and more reliable across any hardware, model architecture, or deployment environment.
The founders emphasize the critical role of open source in AI infrastructure, considering vLLM a "secret weapon" for community-driven execution beyond a single entity's capacity.

Running AI models, particularly Large Language Models (LLMs), presents complex inference challenges due to unpredictable request lengths and concurrent user demands.
Traditional deep learning workloads, like image processing, use static, uniform data batched efficiently, unlike LLM inference with dynamic input lengths.
Serving systems evolved from micro-batching for traditional models to needing advanced scheduling and memory management for continuous, variable LLM request flows and non-deterministic output.

The vLLM project grew from a few graduate students to over 50 regular contributors and more than 2,000 total GitHub contributors, making it one of the fastest-growing open-source projects.
Its community includes contributors from academia, companies like Meta and Red Hat, and support from major hardware providers such as NVIDIA, AMD, Google, AWS, and Intel.
vLLM's success stems from its role as a universal inference layer, incentivizing contributions from model providers, silicon manufacturers, and infrastructure companies.
a16z provided early grant funding, which helped cover significant CI/CD costs exceeding $100,000 monthly, contributing to an annual burn rate of approximately $1 million.

An AI inference engine is a system designed to run a trained model efficiently on accelerated hardware, generating outputs like text and images.
Key components of a typical inference engine include an API server, a tokenizer, a scheduler for batching requests, a memory manager for KV cache, and a worker that executes the model.
These components work in concert to optimize for speed and resource utilization during the model execution and input/output processing phases.

AI inference complexity is escalating due to model scale, diversity, and the emergence of agents, with some models exceeding a trillion parameters and trending towards multi-trillion sizes.
The scale of GPU usage for inference is substantial, with vLLM observing 400,000 to 500,000 GPUs running 24/7 across diverse architectures, precluding a one-size-fits-all solution.
Model diversity further complicates inference, as new open-source models are released frequently, featuring variations in attention mechanisms and memory management strategies.
vLLM leverages its open-source community and model vendors to implement new operations, like sparse attention, across various environments.

AI inference challenges now extend to agentic systems, requiring new infrastructure for tool calling and multi-agent interactions, leading to co-optimization of agent and inference architectures.
The shift from simple text-in, text-out to complex agentic systems with multi-turn conversations and external tool use necessitates smarter cache management.
The speakers advocate for open-source AI, believing that diversity in models and chip architectures is essential to address the complexity of real-world applications and foster innovation.

vLLM is widely adopted, powering Amazon's Rufus assistant bot and being used by Character AI for cutting-edge features, indicating rapid industry adoption of advanced AI.
Inferact was founded by the vLLM creators to develop a universal, open-source inference layer for efficient AI model deployment across diverse hardware and architectures.
Jan Stoika, a co-advisor from their PhD studies at Berkeley and co-founder of Databricks, is involved with Inferact, contributing insights into open-source adoption and research trends.

Inferact focuses on solving challenges in large-scale AI inference, particularly optimizing the use of new hardware like GB200 and MBL72 racks for giant open-source models.
The company is actively hiring experienced ML infrastructure engineers to address these complex problems.
Inferact aims to build a universal, open-source software layer that abstracts hardware for AI models, likened to the importance of operating systems and database abstractions, to support future AI software on accelerated computing devices.