Kanjun in October and

Latent Space: The AI Engineer Podcast

State of the Art: Training >70B LLMs on 10,000 H100 clusters

Overview

* Databricks' acquisition of Mosaic has led to significant developments including DBRX, a text-to-image model trained exclusively on Shutterstock's dataset with full provenance transparency—addressing a critical industry concern about training data origins.

* Imbue is releasing three key resources: infrastructure training scripts for managing hardware failures, evaluation resources including cleaned benchmarks with 450,000 human judgments, and CARBS (Cost-Aware Hyperparameter Optimizer) for efficient experimentation across model scales.

* The teams operate a massive 4,092 H100 GPU cluster with custom infrastructure, taking a "full stack approach" by co-designing hardware with providers and developing lightweight, debuggable tools rather than relying on complex cloud services.

* Both companies emphasize practical AI evaluation over abstract reasoning benchmarks, focusing on code understanding, retrieval-augmented generation, and capabilities that translate to real-world utility for enterprise customers.

* Future work will focus on code capabilities as a versatile "god tool" for problem-solving, with upcoming releases including a small model called "Abra" and continued development of internal product prototypes.

Content

Databricks/Mosaic Acquisition and Recent Developments

- Developed in collaboration with Shutterstock - Trained exclusively on Shutterstock's dataset with known provenance of every image - Designed with enterprise customer transparency in mind - Features a new dinosaur mascot and plush toy to make the model name more memorable

Imbue's Three Main Categories of Releases

1. Infrastructure and Training Scripts - Scripts for managing hardware and hardware failures - Enables more efficient and stable model training - Addresses challenges of training foundation models, especially for smaller companies

2. Evaluation Resources - New benchmark for code reasoning understanding - Cleaned versions of 11 open-source benchmarks - 450,000 human judgments about ambiguity and question quality - Aims to prevent data contamination and improve evaluation accuracy

3. CARBS (Cost-Aware Hyperparameter Optimizer) - Helps experiment at smaller scales and scale up precisely - Enables tuning of hyperparameters across different model sizes - Allows more efficient learning of scaling laws

Technical Infrastructure and Challenges

- InfiniBand cables stolen from data center - GPU memory correction issues causing job delays - Potential GPU computational errors (incorrect math calculations)

- Expected failures when bringing machines online, especially with early hardware builds - Worked closely with manufacturers to identify and fix firmware-level issues - At large scale, even rare problems (1 in 1000) become likely to occur

- Cloud providers offer limited visibility and debugging capabilities - Direct relationship with hardware manufacturers allows for collaborative troubleshooting, custom firmware updates, and greater control

Infrastructure Management Insights

- Approximately 3% of machines are expected to break every week - This implies a potential full machine turnover within a year

- Examine every boot log line - Check if log lines are expected and in the correct order - Create a triage process for potential issues

Hardware Setup and Debugging Challenges

- Unusual performance patterns observed during training, such as gradual decline in Machine Utilization (MFU) - Unexpected performance degradation - Root causes often traced to technical issues like memory fragmentation, garbage collection timing, and CPU throttling

- Emphasized importance of having good diagnostic tools - Ability to trace and identify performance bottlenecks - Potential solutions include turning off garbage collection, manually scheduling it, and monitoring CPU and heat metrics

- Leveraged libraries and implementations from NVIDIA's Megatron and DeepSpeed - Appreciated existing open-source tuning examples - Noted lack of standardized file system solutions in the ecosystem

Infrastructure and Tooling Philosophy

- Created a local, simple file storage solution instead of using complex cloud services - Developed a custom data loader with standardization across different infrastructure - Use Kubernetes as a hardware abstraction layer - Built a custom, simpler alternative to Kubernetes tailored for running experiments

- Kraken (from Uber): Distributed Docker registry using BitTorrent for efficient image transfer - Advantages: Fast, robust, and efficient image distribution across multiple machines

Advanced Infrastructure Challenges

- Network bandwidth limitations - Data transmission complexities - Parallelism strategies (FSTP, HSTP)

- Google's TPUs have different compute-to-bandwidth ratios compared to commodity hardware - Google uses unique network architectures like three-dimensional Taurus ring reduces - NVIDIA's upcoming hardware (GH200, GB200) introduces hybrid networking with GPU blocks

- Larger systems experience more frequent hardware failures - Need for built-in redundancy becomes critical - Restarting becomes less feasible with massive infrastructure

Storage Challenges and Current Focus

- Developing a text-only model - Prioritizing core reasoning and coding capabilities - Not currently pursuing vision/multimodal models - Open to partnering with other teams developing image and multimodal models - Working across the AI stack (infrastructure, pre-training, RL, fine-tuning, products)

CARBS and Model Scaling Approach

- A hyperparameter tuning approach that considers computational cost - Unlike traditional methods, CARBS models the expense of different configuration samples - Allows exploration of performance across varying data and compute scales - Can test model performance with partial data (e.g., 1/10th or 1/100th of full dataset) - Aims to understand performance scaling with increased data and computational resources

- Discussing how network performance changes as models scale up - Exploring scaling laws for various parameters like number of layers, learning rate, and regularization techniques - Used a custom tokenizer and investigated how to scale it effectively

- Most researchers typically use loss as an evaluation metric - Loss provides precise, fine-grained differences between hyperparameters - This team developed a more nuanced approach focused on perplexity for multiple-choice questions - Used a held-out evaluation dataset independent of training data

- Investigated changing data mix during training - Hoped to see models progressively learn more complex skills - Experimental results showed only tiny performance improvements from changing data mix - Not considered a promising research direction

Research Insights and Evaluation Challenges

- Key discussion around the concept of "emergence" in language models - Referenced a paper suggesting emergent behavior might be an artifact of evaluation metrics - Accuracy improvements often appear more dramatic when viewed on a logarithmic scale - Models can appear to improve by becoming less confident in incorrect predictions

- Importance of careful metric selection - Understanding log-scale representations - Checking for potential test set memorization - Suggested randomizing multiple-choice question order as a way to test model genuine understanding

Evaluation Methodology and Dataset Quality

- Carefully examined existing datasets for quality and coherence - Identified many datasets have messy, potentially incoherent examples - Recognized that dataset creation often involves human errors and ambiguities - Developed a detailed process to assess what constitutes a "good" question and answer

- Reproduced 500-1000 examples for each dataset - Ensured training data was completely separate from testing data - Aimed to increase confidence in model performance measurements

- Many benchmark performance metrics are based on ambiguous or nonsensical questions - When using clear, unambiguous examples, model performance tends to converge - Noted interesting trend in ethics datasets where models are overly cautious to avoid potential controversies

Code Understanding Evaluation

- Higher-level evaluations (like assessing code architecture) become increasingly difficult - Realistic code tasks reveal nuanced challenges in benchmark testing - Example: Some benchmark solutions pass tests but might not represent optimal real-world implementations

- Mentioned SweetBench as a new, more challenging code-related dataset - Discussed bug-fixing tasks as an example of complex evaluation scenarios - Referenced AgentBench paper, highlighting limitations in current benchmark methodologies

Philosophical Approach to Evaluation

- Current evaluation benchmarks are consistently disappointing - Despite imperfections, benchmarks provide useful progress indicators - Models seem to be incrementally improving year-to-year

- Successful researchers must be comfortable operating in an inherently broken system - Ability to make progress despite incomplete information is crucial - Finding a balanced approach between perfectionism and chaos is key

Perspectives on Abstract Reasoning Benchmarks

- Arc AGI is described as attempting to measure reasoning through an abstract IQ test - The speakers are cautious about over-emphasizing abstract reasoning benchmarks - They argue that such tests may not directly translate to real-world utility

- Prioritize evaluations that test basic model functionality - Prefer direct, quick-answer assessments without complex reasoning chains - Skeptical of benchmarks that can be optimized without improving general capabilities

- Databricks' customers are more interested in practical AI applications - Real-world use cases are more valuable than abstract reasoning tests - Example of a useful AI: a model that can converse about data and run SQL queries

Retrieval-Augmented Generation and Long Context

- Considered "the world's simplest agent" - Allows models to decide when to retrieve data from context/database - Represents an early form of agent-like capability

- Long contexts are not inherently problematic - Potential benefits include thousand-shot tasks as alternative to fine-tuning, pulling large amounts of data into context - Inevitable in multimodal scenarios

- Annotating long context is extremely difficult and expensive - Requires human reading of massive token sets - "Needle in a haystack" tests are problematic because they don't measure holistic context usage and can be gamed

- Coding scenarios: Filtering relevant repository context - Sorting contextual information by importance - Avoiding computational waste during inference

Code and Tool Use Approach

- Tool use is fundamentally about helping models interact effectively with structured data - Recommend continuing to use structured data APIs and languages instead of flattening everything into an LLM context - Text-to-SQL and backend SQL calls are particularly useful for practical applications

- Frustration with LLMs treating complex systems (like programming languages) as simple token streams - Despite decades of understanding programming language structures, models are forced to relearn everything from scratch - Current approach loses nuanced structural knowledge

- Different tools suit different problems: SQL for tabular data, knowledge graphs for complex entity relationships - Acknowledge knowledge graphs have limitations in handling messy, ambiguous real-world relationships - No single universal data representation tool exists

Future Plans and Closing Thoughts

- Making AI capabilities practically useful for real-world workflows - Improving code generation, understanding, testing, and verification - Serving their 12,000 customers

- Writing and sharing more blog posts about their scientific work - Potentially releasing new models (including a teased small model called "Abra") - Continuing to develop internal product prototypes

- Upcoming release at AI Engineer World's Fair (ai.engineer live stream) - Job opportunities for those interested in coding reasoning, understanding hardware and model mechanics, and designing practical AI systems

- Encouragement for those feeling exhausted or overwhelmed in the AI field - Reminder that even in seemingly well-understood domains (like cluster setup or evaluations), there are still more insights to discover and substantial work left to do - Despite feeling like the AI field is crowded and resource-intensive, there remains enormous potential for impactful work, significant unexplored areas, and opportunities for fresh perspectives

More from Latent Space: The AI Engineer Podcast

Explore all episode briefs from this podcast

View All Episodes →

Listen smarter with PodBrief

Get AI-powered briefs for all your favorite podcasts, plus a daily feed that keeps you informed.

Download on the App Store