Key Takeaways
- Inferact, founded by vLLM creators, aims to build a universal, open-source AI inference layer.
- AI inference has become a complex problem due to dynamic workloads, diverse models, and agentic systems.
- vLLM, an open-source project, provides a universal inference solution, supported by a large community and major tech companies.
- The open-source approach is deemed critical for innovation, enabling developers to build upon common ground and ensure hardware compatibility.
- Optimizing AI performance for enterprise use cases requires control over the entire stack, including specialized hardware and applications.
Deep Dive
- Inferact was founded by vLLM creators Simon Moe and Woosuk Kwon with the mission to build a universal, open-source inference layer.
- This layer aims to make large AI models faster, cheaper, and more reliable across any hardware, model architecture, or deployment environment.
- The founders emphasize the critical role of open source in AI infrastructure, considering vLLM a "secret weapon" for community-driven execution beyond a single entity's capacity.
- Running AI models, particularly Large Language Models (LLMs), presents complex inference challenges due to unpredictable request lengths and concurrent user demands.
- Traditional deep learning workloads, like image processing, use static, uniform data batched efficiently, unlike LLM inference with dynamic input lengths.
- Serving systems evolved from micro-batching for traditional models to needing advanced scheduling and memory management for continuous, variable LLM request flows and non-deterministic output.
- The vLLM project grew from a few graduate students to over 50 regular contributors and more than 2,000 total GitHub contributors, making it one of the fastest-growing open-source projects.
- Its community includes contributors from academia, companies like Meta and Red Hat, and support from major hardware providers such as NVIDIA, AMD, Google, AWS, and Intel.
- vLLM's success stems from its role as a universal inference layer, incentivizing contributions from model providers, silicon manufacturers, and infrastructure companies.
- a16z provided early grant funding, which helped cover significant CI/CD costs exceeding $100,000 monthly, contributing to an annual burn rate of approximately $1 million.
- An AI inference engine is a system designed to run a trained model efficiently on accelerated hardware, generating outputs like text and images.
- Key components of a typical inference engine include an API server, a tokenizer, a scheduler for batching requests, a memory manager for KV cache, and a worker that executes the model.
- These components work in concert to optimize for speed and resource utilization during the model execution and input/output processing phases.
- AI inference complexity is escalating due to model scale, diversity, and the emergence of agents, with some models exceeding a trillion parameters and trending towards multi-trillion sizes.
- The scale of GPU usage for inference is substantial, with vLLM observing 400,000 to 500,000 GPUs running 24/7 across diverse architectures, precluding a one-size-fits-all solution.
- Model diversity further complicates inference, as new open-source models are released frequently, featuring variations in attention mechanisms and memory management strategies.
- vLLM leverages its open-source community and model vendors to implement new operations, like sparse attention, across various environments.
- AI inference challenges now extend to agentic systems, requiring new infrastructure for tool calling and multi-agent interactions, leading to co-optimization of agent and inference architectures.
- The shift from simple text-in, text-out to complex agentic systems with multi-turn conversations and external tool use necessitates smarter cache management.
- The speakers advocate for open-source AI, believing that diversity in models and chip architectures is essential to address the complexity of real-world applications and foster innovation.
- vLLM is widely adopted, powering Amazon's Rufus assistant bot and being used by Character AI for cutting-edge features, indicating rapid industry adoption of advanced AI.
- Inferact was founded by the vLLM creators to develop a universal, open-source inference layer for efficient AI model deployment across diverse hardware and architectures.
- Jan Stoika, a co-advisor from their PhD studies at Berkeley and co-founder of Databricks, is involved with Inferact, contributing insights into open-source adoption and research trends.
- Inferact focuses on solving challenges in large-scale AI inference, particularly optimizing the use of new hardware like GB200 and MBL72 racks for giant open-source models.
- The company is actively hiring experienced ML infrastructure engineers to address these complex problems.
- Inferact aims to build a universal, open-source software layer that abstracts hardware for AI models, likened to the importance of operating systems and database abstractions, to support future AI software on accelerated computing devices.