Overview

* Databricks' acquisition of Mosaic has led to significant developments including DBRX, a text-to-image model trained exclusively on Shutterstock's dataset with full provenance transparency—addressing a critical industry concern about training data origins.

* Imbue is releasing three key resources: infrastructure training scripts for managing hardware failures, evaluation resources including cleaned benchmarks with 450,000 human judgments, and CARBS (Cost-Aware Hyperparameter Optimizer) for efficient experimentation across model scales.

* The teams operate a massive 4,092 H100 GPU cluster with custom infrastructure, taking a "full stack approach" by co-designing hardware with providers and developing lightweight, debuggable tools rather than relying on complex cloud services.

* Both companies emphasize practical AI evaluation over abstract reasoning benchmarks, focusing on code understanding, retrieval-augmented generation, and capabilities that translate to real-world utility for enterprise customers.

* Future work will focus on code capabilities as a versatile "god tool" for problem-solving, with upcoming releases including a small model called "Abra" and continued development of internal product prototypes.

Content

Databricks/Mosaic Acquisition and Recent Developments

John Franco is now Chief AI Scientist at Databricks following the Mosaic acquisition
Most recent significant announcement is a new text-to-image model called DBRX (pronounced "D-V-R-X")

- Developed in collaboration with Shutterstock - Trained exclusively on Shutterstock's dataset with known provenance of every image - Designed with enterprise customer transparency in mind - Features a new dinosaur mascot and plush toy to make the model name more memorable

Industry context: Shutterstock's dataset is highly valuable, with multiple major tech companies (OpenAI, Google, Meta, Apple) reportedly using Shutterstock data
The model addresses concerns about image model training data transparency

Imbue's Three Main Categories of Releases

Josh, CTO of Imbue, introduces three main categories of releases:

1. Infrastructure and Training Scripts - Scripts for managing hardware and hardware failures - Enables more efficient and stable model training - Addresses challenges of training foundation models, especially for smaller companies

2. Evaluation Resources - New benchmark for code reasoning understanding - Cleaned versions of 11 open-source benchmarks - 450,000 human judgments about ambiguity and question quality - Aims to prevent data contamination and improve evaluation accuracy

3. CARBS (Cost-Aware Hyperparameter Optimizer) - Helps experiment at smaller scales and scale up precisely - Enables tuning of hyperparameters across different model sizes - Allows more efficient learning of scaling laws

The speakers emphasize the critical but often overlooked challenges of cluster management, software deployment, fault tolerance during training, and GPU communication performance
Context includes a brief mention of DVRX's model (132 billion total parameters, 36 billion active parameters, trained on 12 trillion tokens)
The release focuses on sharing tools and insights, not model weights

Technical Infrastructure and Challenges

Discussing a massive GPU cluster with 4,092 H100 GPUs across 511 computers
The cluster uses a non-standard three-tier network architecture (typically clusters are two-tier with around 1,024 GPUs)
Worked closely with Voltage Park to design and set up the infrastructure

Experienced numerous technical difficulties, including:

- InfiniBand cables stolen from data center - GPU memory correction issues causing job delays - Potential GPU computational errors (incorrect math calculations)

Took a "full stack approach" by co-designing hardware with providers like Dell and NVIDIA
Used Metal as a Service to automate operating system installation across hundreds of machines
Each machine costs hundreds of thousands of dollars and has multiple networking interfaces

Challenges in hardware deployment:

- Expected failures when bringing machines online, especially with early hardware builds - Worked closely with manufacturers to identify and fix firmware-level issues - At large scale, even rare problems (1 in 1000) become likely to occur

Motivation for custom approach rather than using cloud providers:

- Cloud providers offer limited visibility and debugging capabilities - Direct relationship with hardware manufacturers allows for collaborative troubleshooting, custom firmware updates, and greater control

Infrastructure Management Insights

Most AI companies rely on cloud providers to handle infrastructure complexities
Direct engagement with hardware providers (NVIDIA, Dell) can be more efficient than multi-layered communication
Infrastructure management involves intricate details like power contracts, InfiniBand cable quality, firmware updates, and physical security

Reliability expectations:

- Approximately 3% of machines are expected to break every week - This implies a potential full machine turnover within a year

The team has achieved a lower than expected failure rate due to proactive root cause analysis
Developed detailed health checks that:

- Examine every boot log line - Check if log lines are expected and in the correct order - Create a triage process for potential issues

Infrastructure team is very small (3-6 people) but accomplishes significant work through high skill level
Relies on collaboration with vendors like Dell, H5, and NVIDIA
Emphasizes learning and documenting processes for future reference

Hardware Setup and Debugging Challenges

Massive infrastructure setup involving 12,000 cables (24,000 connection points)
Significant complexity in hardware installation and maintenance
Collaborative effort involving teams from Dell, NVIDIA, and H5

Technical performance and debugging insights:

- Unusual performance patterns observed during training, such as gradual decline in Machine Utilization (MFU) - Unexpected performance degradation - Root causes often traced to technical issues like memory fragmentation, garbage collection timing, and CPU throttling

Debugging approach:

- Emphasized importance of having good diagnostic tools - Ability to trace and identify performance bottlenecks - Potential solutions include turning off garbage collection, manually scheduling it, and monitoring CPU and heat metrics

Open source contributions:

- Leveraged libraries and implementations from NVIDIA's Megatron and DeepSpeed - Appreciated existing open-source tuning examples - Noted lack of standardized file system solutions in the ecosystem

Infrastructure and Tooling Philosophy

Preference for simple, lightweight tools that are easy to debug
Avoid complex infrastructure with multiple layers of abstraction
Use basic tools like Bash, Python, SSH, and Docker
Aim to keep systems straightforward to minimize maintenance overhead

Specific infrastructure approaches:

- Created a local, simple file storage solution instead of using complex cloud services - Developed a custom data loader with standardization across different infrastructure - Use Kubernetes as a hardware abstraction layer - Built a custom, simpler alternative to Kubernetes tailored for running experiments

Notable tools:

- Kraken (from Uber): Distributed Docker registry using BitTorrent for efficient image transfer - Advantages: Fast, robust, and efficient image distribution across multiple machines

Currently running on 6-7 different cloud providers
Standardize on commodity infrastructure where possible
Design infrastructure with failure and infrastructure upgrades as expected scenarios

Advanced Infrastructure Challenges

The speaker emphasizes avoiding unnecessary complexity in deep learning projects, advising to use only the minimum tools and strategies required

Key infrastructure challenges include:

- Network bandwidth limitations - Data transmission complexities - Parallelism strategies (FSTP, HSTP)

Comparative infrastructure insights:

- Google's TPUs have different compute-to-bandwidth ratios compared to commodity hardware - Google uses unique network architectures like three-dimensional Taurus ring reduces - NVIDIA's upcoming hardware (GH200, GB200) introduces hybrid networking with GPU blocks

Scaling challenges emerge as model sizes increase:

- Larger systems experience more frequent hardware failures - Need for built-in redundancy becomes critical - Restarting becomes less feasible with massive infrastructure

Storage Challenges and Current Focus

Large-scale AI training encounters significant storage challenges, especially with multi-petabyte datasets
Data center storage limitations can force data to be stored in different regions
Cheap, fast storage is crucial for effective checkpointing and training

Imbue's current focus:

- Developing a text-only model - Prioritizing core reasoning and coding capabilities - Not currently pursuing vision/multimodal models - Open to partnering with other teams developing image and multimodal models - Working across the AI stack (infrastructure, pre-training, RL, fine-tuning, products)

CARBS and Model Scaling Approach

CARBS (Cost Aware Pareto Region Bayesian Search):

- A hyperparameter tuning approach that considers computational cost - Unlike traditional methods, CARBS models the expense of different configuration samples - Allows exploration of performance across varying data and compute scales - Can test model performance with partial data (e.g., 1/10th or 1/100th of full dataset) - Aims to understand performance scaling with increased data and computational resources

Model scaling and performance:

- Discussing how network performance changes as models scale up - Exploring scaling laws for various parameters like number of layers, learning rate, and regularization techniques - Used a custom tokenizer and investigated how to scale it effectively

Evaluation metrics:

- Most researchers typically use loss as an evaluation metric - Loss provides precise, fine-grained differences between hyperparameters - This team developed a more nuanced approach focused on perplexity for multiple-choice questions - Used a held-out evaluation dataset independent of training data

Data mix experiments:

- Investigated changing data mix during training - Hoped to see models progressively learn more complex skills - Experimental results showed only tiny performance improvements from changing data mix - Not considered a promising research direction

Research Insights and Evaluation Challenges

Researchers are exploring various techniques to improve language models, including data set modifications and parameter tuning
Different labs have unique approaches to mitigating model challenges, leading to seemingly contradictory but actually complementary research

Emergence and evaluation metrics:

- Key discussion around the concept of "emergence" in language models - Referenced a paper suggesting emergent behavior might be an artifact of evaluation metrics - Accuracy improvements often appear more dramatic when viewed on a logarithmic scale - Models can appear to improve by becoming less confident in incorrect predictions

Methodological insights:

- Importance of careful metric selection - Understanding log-scale representations - Checking for potential test set memorization - Suggested randomizing multiple-choice question order as a way to test model genuine understanding

Evaluation Methodology and Dataset Quality

The speakers took a meticulous approach to evaluating natural language understanding and reasoning datasets:

- Carefully examined existing datasets for quality and coherence - Identified many datasets have messy, potentially incoherent examples - Recognized that dataset creation often involves human errors and ambiguities - Developed a detailed process to assess what constitutes a "good" question and answer

Data preparation steps:

- Reproduced 500-1000 examples for each dataset - Ensured training data was completely separate from testing data - Aimed to increase confidence in model performance measurements

Interesting observations about dataset performance:

- Many benchmark performance metrics are based on ambiguous or nonsensical questions - When using clear, unambiguous examples, model performance tends to converge - Noted interesting trend in ethics datasets where models are overly cautious to avoid potential controversies

Code Understanding Evaluation

The discussion focuses on new evaluation methods for AI models, particularly in code understanding
They are releasing a code understanding evaluation benchmark that can generate infinite data programmatically

Current focus is on low-level code understanding (variable-level context)
Goal is to enable smaller scale models to perform code-related tasks
They've created an internal version of the MBPP dataset, carefully reviewing each example to remove ambiguity

Challenges in advanced evaluations:

- Higher-level evaluations (like assessing code architecture) become increasingly difficult - Realistic code tasks reveal nuanced challenges in benchmark testing - Example: Some benchmark solutions pass tests but might not represent optimal real-world implementations

Emerging datasets:

- Mentioned SweetBench as a new, more challenging code-related dataset - Discussed bug-fixing tasks as an example of complex evaluation scenarios - Referenced AgentBench paper, highlighting limitations in current benchmark methodologies

Philosophical Approach to Evaluation

Evaluation of AI models is extremely challenging and complex
Imperfection is inherent in deep learning and model assessment

Evaluation challenges:

- Current evaluation benchmarks are consistently disappointing - Despite imperfections, benchmarks provide useful progress indicators - Models seem to be incrementally improving year-to-year

Philosophical approach to imperfection:

- Successful researchers must be comfortable operating in an inherently broken system - Ability to make progress despite incomplete information is crucial - Finding a balanced approach between perfectionism and chaos is key

More valuable to have a model that can communicate its uncertainties
Preference for a model that acknowledges potential errors over one that confidently provides incorrect solutions

Perspectives on Abstract Reasoning Benchmarks

The conversation focuses on AI evaluation methods, specifically discussing Arc AGI and abstract reasoning benchmarks

Perspectives on Arc AGI and abstract reasoning:

- Arc AGI is described as attempting to measure reasoning through an abstract IQ test - The speakers are cautious about over-emphasizing abstract reasoning benchmarks - They argue that such tests may not directly translate to real-world utility

Evaluation approach:

- Prioritize evaluations that test basic model functionality - Prefer direct, quick-answer assessments without complex reasoning chains - Skeptical of benchmarks that can be optimized without improving general capabilities

Practical considerations:

- Databricks' customers are more interested in practical AI applications - Real-world use cases are more valuable than abstract reasoning tests - Example of a useful AI: a model that can converse about data and run SQL queries

Retrieval-Augmented Generation and Long Context

Retrieval-Augmented Generation (RAG):

- Considered "the world's simplest agent" - Allows models to decide when to retrieve data from context/database - Represents an early form of agent-like capability

Long context considerations:

- Long contexts are not inherently problematic - Potential benefits include thousand-shot tasks as alternative to fine-tuning, pulling large amounts of data into context - Inevitable in multimodal scenarios

Challenges with long context evaluation:

- Annotating long context is extremely difficult and expensive - Requires human reading of massive token sets - "Needle in a haystack" tests are problematic because they don't measure holistic context usage and can be gamed

Practical long context applications:

- Coding scenarios: Filtering relevant repository context - Sorting contextual information by importance - Avoiding computational waste during inference

Code and Tool Use Approach

Focus on robust code writing, execution, and debugging rather than hard-coded agents with limited tools
Believe that improving code capabilities will dramatically expand potential actions and API interactions
View code as a versatile "god tool" for solving complex problems

Structured data interaction:

- Tool use is fundamentally about helping models interact effectively with structured data - Recommend continuing to use structured data APIs and languages instead of flattening everything into an LLM context - Text-to-SQL and backend SQL calls are particularly useful for practical applications

Challenges with current LLM approaches:

- Frustration with LLMs treating complex systems (like programming languages) as simple token streams - Despite decades of understanding programming language structures, models are forced to relearn everything from scratch - Current approach loses nuanced structural knowledge

Data representation considerations:

- Different tools suit different problems: SQL for tabular data, knowledge graphs for complex entity relationships - Acknowledge knowledge graphs have limitations in handling messy, ambiguous real-world relationships - No single universal data representation tool exists

Future Plans and Closing Thoughts

Databricks team is focused on:

- Making AI capabilities practically useful for real-world workflows - Improving code generation, understanding, testing, and verification - Serving their 12,000 customers

Upcoming activities:

- Writing and sharing more blog posts about their scientific work - Potentially releasing new models (including a teased small model called "Abra") - Continuing to develop internal product prototypes

Calls to action and opportunities:

- Upcoming release at AI Engineer World's Fair (ai.engineer live stream) - Job opportunities for those interested in coding reasoning, understanding hardware and model mechanics, and designing practical AI systems

Motivational message:

- Encouragement for those feeling exhausted or overwhelmed in the AI field - Reminder that even in seemingly well-understood domains (like cluster setup or evaluations), there are still more insights to discover and substantial work left to do - Despite feeling like the AI field is crowded and resource-intensive, there remains enormous potential for impactful work, significant unexplored areas, and opportunities for fresh perspectives

State of the Art: Training >70B LLMs on 10,000 H100 clusters