Key Takeaways
- Inferact aims to build a universal, open-source AI inference layer for faster, cheaper, and more reliable models.
- vLLM originated from UC Berkeley PhD research in 2022 to address unique large language model challenges.
- LLM inference is complicated by dynamic input lengths, stochastic outputs, and increasing model/hardware diversity.
- The vLLM project rapidly grew to over 2,000 GitHub contributors, including major tech and silicon providers.
- Open-source infrastructure is critical for AI, enabling deep stack tuning and fostering innovation.
Deep Dive
- Inferact founders and vLLM creators aim to establish vLLM as a universal, open-source inference engine.
- vLLM originated from co-founder Woosuk Kwon's 2022 PhD research prototype at UC Berkeley, evolving into a widely used open-source runtime.
- The project initially focused on optimizing a demo service for Meta's 175-billion-parameter OPT model.
- This research explored the unique challenges of autoregressive language models, diverging from traditional AI workloads.
- LLM inference workloads are characterized by dynamic input lengths, ranging from single words to hundreds of pages.
- This dynamism contrasts with traditional deep learning, which uses static, manageable workloads for batched processing.
- Core technical challenges in LLM inference include scheduling and memory management, moving beyond traditional micro-batching.
- The stochastic and continuous nature of LLM output, where the model determines its stopping point, further complicates inference.
- The vLLM open-source project grew from a few graduate students to over 50 regular and 2,000 total contributors on GitHub.
- It is one of the fastest-growing open-source projects, attracting contributors from Meta, Red Hat, NVIDIA, AMD, Google, AWS, and Intel.
- The community addresses the 'M times M' problem by creating a universal inference layer for all models across various hardware.
- vLLM's management philosophy includes clear team scopes, objectives, and milestones, drawing lessons from projects like Kubernetes and Linux.
- An inference engine runs pre-trained models on accelerated hardware to efficiently generate outputs like text and images.
- The vLLM architecture includes an API server, a tokenizer, a scheduler for batching requests, and a memory manager for KV cache.
- A dedicated worker component initializes and runs the AI model within the inference engine.
- Operational costs for vLLM are significant, with a CI bill exceeding $100,000 and an annual burn rate of approximately $1 million.
- The difficulty of running inference has significantly increased over the past 1.5 years due to rising model scale, diversity, and agents.
- Models like Mixtral 8x22B have over 100 billion parameters, with multi-trillion parameter open-source models predicted for this year.
- Running larger models necessitates distributing them across multiple GPUs and nodes, posing challenges in sharding and load balancing.
- vLLM is used on an estimated 400,000 to 500,000 GPUs globally, running 24/7 across diverse GPU architectures.
- AI infrastructure is increasingly complex due to diverse models, hardware, deployment scenarios, and emerging agent challenges.
- Supporting agents requires new infrastructure for tool calling and multi-agent interactions, disrupting traditional text-in/text-out paradigms.
- Agent applications involve multi-turn conversations, external tool use, and variable interaction times, from seconds to hours.
- Effective cache management is critical in dynamic agent environments, as unpredictable interactions complicate cache invalidation.
- Inferact's mission is to build a universal, open-source inference layer to make AI models faster and more reliable across any hardware or deployment environment.
- Open-source AI is viewed as critical for the AI infrastructure ecosystem and a 'secret weapon' for community-driven execution.
- The belief in open-source AI emphasizes diversity in models and hardware architectures as key to addressing real-world applications.
- Open source enables enterprises to perform deep stack tuning for specific use cases, unlike limitations with closed-source models.
- Jan, a co-founder of Databricks and advisor to vLLM, is also an Inferact co-founder, advising on open-source adoption and talent.
- Inferact is actively hiring experienced ML infrastructure engineers to optimize hardware like GB200 and MBL72 for giant open-source models.
- The company focuses on building a horizontal abstraction layer for accelerated computing, akin to operating systems and databases.
- This universal layer aims to abstract GPUs and other accelerated computing devices for AI models, improving speed, cost, and reliability.