Overview
* Databricks' acquisition of Mosaic has led to significant developments including DBRX, a text-to-image model trained exclusively on Shutterstock's dataset with full provenance transparency—addressing a critical industry concern about training data origins.
* Imbue is releasing three key resources: infrastructure training scripts for managing hardware failures, evaluation resources including cleaned benchmarks with 450,000 human judgments, and CARBS (Cost-Aware Hyperparameter Optimizer) for efficient experimentation across model scales.
* The teams operate a massive 4,092 H100 GPU cluster with custom infrastructure, taking a "full stack approach" by co-designing hardware with providers and developing lightweight, debuggable tools rather than relying on complex cloud services.
* Both companies emphasize practical AI evaluation over abstract reasoning benchmarks, focusing on code understanding, retrieval-augmented generation, and capabilities that translate to real-world utility for enterprise customers.
* Future work will focus on code capabilities as a versatile "god tool" for problem-solving, with upcoming releases including a small model called "Abra" and continued development of internal product prototypes.
Content
Databricks/Mosaic Acquisition and Recent Developments
- John Franco is now Chief AI Scientist at Databricks following the Mosaic acquisition
- Most recent significant announcement is a new text-to-image model called DBRX (pronounced "D-V-R-X")
- Developed in collaboration with Shutterstock
- Trained exclusively on Shutterstock's dataset with known provenance of every image
- Designed with enterprise customer transparency in mind
- Features a new dinosaur mascot and plush toy to make the model name more memorable
- Industry context: Shutterstock's dataset is highly valuable, with multiple major tech companies (OpenAI, Google, Meta, Apple) reportedly using Shutterstock data
- The model addresses concerns about image model training data transparency
Imbue's Three Main Categories of Releases
- Josh, CTO of Imbue, introduces three main categories of releases:
1. Infrastructure and Training Scripts
- Scripts for managing hardware and hardware failures
- Enables more efficient and stable model training
- Addresses challenges of training foundation models, especially for smaller companies2. Evaluation Resources
- New benchmark for code reasoning understanding
- Cleaned versions of 11 open-source benchmarks
- 450,000 human judgments about ambiguity and question quality
- Aims to prevent data contamination and improve evaluation accuracy
3. CARBS (Cost-Aware Hyperparameter Optimizer)
- Helps experiment at smaller scales and scale up precisely
- Enables tuning of hyperparameters across different model sizes
- Allows more efficient learning of scaling laws
- The speakers emphasize the critical but often overlooked challenges of cluster management, software deployment, fault tolerance during training, and GPU communication performance
- Context includes a brief mention of DVRX's model (132 billion total parameters, 36 billion active parameters, trained on 12 trillion tokens)
- The release focuses on sharing tools and insights, not model weights
Technical Infrastructure and Challenges
- Discussing a massive GPU cluster with 4,092 H100 GPUs across 511 computers
- The cluster uses a non-standard three-tier network architecture (typically clusters are two-tier with around 1,024 GPUs)
- Worked closely with Voltage Park to design and set up the infrastructure
- Experienced numerous technical difficulties, including:
- InfiniBand cables stolen from data center
- GPU memory correction issues causing job delays
- Potential GPU computational errors (incorrect math calculations)- Took a "full stack approach" by co-designing hardware with providers like Dell and NVIDIA
- Used Metal as a Service to automate operating system installation across hundreds of machines
- Each machine costs hundreds of thousands of dollars and has multiple networking interfaces
- Challenges in hardware deployment:
- Expected failures when bringing machines online, especially with early hardware builds
- Worked closely with manufacturers to identify and fix firmware-level issues
- At large scale, even rare problems (1 in 1000) become likely to occur- Motivation for custom approach rather than using cloud providers:
- Cloud providers offer limited visibility and debugging capabilities
- Direct relationship with hardware manufacturers allows for collaborative troubleshooting, custom firmware updates, and greater controlInfrastructure Management Insights
- Most AI companies rely on cloud providers to handle infrastructure complexities
- Direct engagement with hardware providers (NVIDIA, Dell) can be more efficient than multi-layered communication
- Infrastructure management involves intricate details like power contracts, InfiniBand cable quality, firmware updates, and physical security
- Reliability expectations:
- Approximately 3% of machines are expected to break every week
- This implies a potential full machine turnover within a year- The team has achieved a lower than expected failure rate due to proactive root cause analysis
- Developed detailed health checks that:
- Examine every boot log line
- Check if log lines are expected and in the correct order
- Create a triage process for potential issues- Infrastructure team is very small (3-6 people) but accomplishes significant work through high skill level
- Relies on collaboration with vendors like Dell, H5, and NVIDIA
- Emphasizes learning and documenting processes for future reference
Hardware Setup and Debugging Challenges
- Massive infrastructure setup involving 12,000 cables (24,000 connection points)
- Significant complexity in hardware installation and maintenance
- Collaborative effort involving teams from Dell, NVIDIA, and H5
- Technical performance and debugging insights:
- Unusual performance patterns observed during training, such as gradual decline in Machine Utilization (MFU)
- Unexpected performance degradation
- Root causes often traced to technical issues like memory fragmentation, garbage collection timing, and CPU throttling - Emphasized importance of having good diagnostic tools
- Ability to trace and identify performance bottlenecks
- Potential solutions include turning off garbage collection, manually scheduling it, and monitoring CPU and heat metrics- Open source contributions:
- Leveraged libraries and implementations from NVIDIA's Megatron and DeepSpeed
- Appreciated existing open-source tuning examples
- Noted lack of standardized file system solutions in the ecosystemInfrastructure and Tooling Philosophy
- Preference for simple, lightweight tools that are easy to debug
- Avoid complex infrastructure with multiple layers of abstraction
- Use basic tools like Bash, Python, SSH, and Docker
- Aim to keep systems straightforward to minimize maintenance overhead
- Specific infrastructure approaches:
- Created a local, simple file storage solution instead of using complex cloud services
- Developed a custom data loader with standardization across different infrastructure
- Use Kubernetes as a hardware abstraction layer
- Built a custom, simpler alternative to Kubernetes tailored for running experiments - Kraken (from Uber): Distributed Docker registry using BitTorrent for efficient image transfer
- Advantages: Fast, robust, and efficient image distribution across multiple machines- Currently running on 6-7 different cloud providers
- Standardize on commodity infrastructure where possible
- Design infrastructure with failure and infrastructure upgrades as expected scenarios
Advanced Infrastructure Challenges
- The speaker emphasizes avoiding unnecessary complexity in deep learning projects, advising to use only the minimum tools and strategies required
- Key infrastructure challenges include:
- Network bandwidth limitations
- Data transmission complexities
- Parallelism strategies (FSTP, HSTP)- Comparative infrastructure insights:
- Google's TPUs have different compute-to-bandwidth ratios compared to commodity hardware
- Google uses unique network architectures like three-dimensional Taurus ring reduces
- NVIDIA's upcoming hardware (GH200, GB200) introduces hybrid networking with GPU blocks- Scaling challenges emerge as model sizes increase:
- Larger systems experience more frequent hardware failures
- Need for built-in redundancy becomes critical
- Restarting becomes less feasible with massive infrastructureStorage Challenges and Current Focus
- Large-scale AI training encounters significant storage challenges, especially with multi-petabyte datasets
- Data center storage limitations can force data to be stored in different regions
- Cheap, fast storage is crucial for effective checkpointing and training
- Developing a text-only model
- Prioritizing core reasoning and coding capabilities
- Not currently pursuing vision/multimodal models
- Open to partnering with other teams developing image and multimodal models
- Working across the AI stack (infrastructure, pre-training, RL, fine-tuning, products)CARBS and Model Scaling Approach
- CARBS (Cost Aware Pareto Region Bayesian Search):
- A hyperparameter tuning approach that considers computational cost
- Unlike traditional methods, CARBS models the expense of different configuration samples
- Allows exploration of performance across varying data and compute scales
- Can test model performance with partial data (e.g., 1/10th or 1/100th of full dataset)
- Aims to understand performance scaling with increased data and computational resources- Model scaling and performance:
- Discussing how network performance changes as models scale up
- Exploring scaling laws for various parameters like number of layers, learning rate, and regularization techniques
- Used a custom tokenizer and investigated how to scale it effectively - Most researchers typically use loss as an evaluation metric
- Loss provides precise, fine-grained differences between hyperparameters
- This team developed a more nuanced approach focused on perplexity for multiple-choice questions
- Used a held-out evaluation dataset independent of training data - Investigated changing data mix during training
- Hoped to see models progressively learn more complex skills
- Experimental results showed only tiny performance improvements from changing data mix
- Not considered a promising research directionResearch Insights and Evaluation Challenges
- Researchers are exploring various techniques to improve language models, including data set modifications and parameter tuning
- Different labs have unique approaches to mitigating model challenges, leading to seemingly contradictory but actually complementary research
- Emergence and evaluation metrics:
- Key discussion around the concept of "emergence" in language models
- Referenced a paper suggesting emergent behavior might be an artifact of evaluation metrics
- Accuracy improvements often appear more dramatic when viewed on a logarithmic scale
- Models can appear to improve by becoming less confident in incorrect predictions - Importance of careful metric selection
- Understanding log-scale representations
- Checking for potential test set memorization
- Suggested randomizing multiple-choice question order as a way to test model genuine understandingEvaluation Methodology and Dataset Quality
- The speakers took a meticulous approach to evaluating natural language understanding and reasoning datasets:
- Carefully examined existing datasets for quality and coherence
- Identified many datasets have messy, potentially incoherent examples
- Recognized that dataset creation often involves human errors and ambiguities
- Developed a detailed process to assess what constitutes a "good" question and answer - Reproduced 500-1000 examples for each dataset
- Ensured training data was completely separate from testing data
- Aimed to increase confidence in model performance measurements- Interesting observations about dataset performance:
- Many benchmark performance metrics are based on ambiguous or nonsensical questions
- When using clear, unambiguous examples, model performance tends to converge
- Noted interesting trend in ethics datasets where models are overly cautious to avoid potential controversiesCode Understanding Evaluation
- The discussion focuses on new evaluation methods for AI models, particularly in code understanding
- They are releasing a code understanding evaluation benchmark that can generate infinite data programmatically
- Current focus is on low-level code understanding (variable-level context)
- Goal is to enable smaller scale models to perform code-related tasks
- They've created an internal version of the MBPP dataset, carefully reviewing each example to remove ambiguity
- Challenges in advanced evaluations:
- Higher-level evaluations (like assessing code architecture) become increasingly difficult
- Realistic code tasks reveal nuanced challenges in benchmark testing
- Example: Some benchmark solutions pass tests but might not represent optimal real-world implementations - Mentioned SweetBench as a new, more challenging code-related dataset
- Discussed bug-fixing tasks as an example of complex evaluation scenarios
- Referenced AgentBench paper, highlighting limitations in current benchmark methodologiesPhilosophical Approach to Evaluation
- Evaluation of AI models is extremely challenging and complex
- Imperfection is inherent in deep learning and model assessment
- Current evaluation benchmarks are consistently disappointing
- Despite imperfections, benchmarks provide useful progress indicators
- Models seem to be incrementally improving year-to-year- Philosophical approach to imperfection:
- Successful researchers must be comfortable operating in an inherently broken system
- Ability to make progress despite incomplete information is crucial
- Finding a balanced approach between perfectionism and chaos is key- More valuable to have a model that can communicate its uncertainties
- Preference for a model that acknowledges potential errors over one that confidently provides incorrect solutions
Perspectives on Abstract Reasoning Benchmarks
- The conversation focuses on AI evaluation methods, specifically discussing Arc AGI and abstract reasoning benchmarks
- Perspectives on Arc AGI and abstract reasoning:
- Arc AGI is described as attempting to measure reasoning through an abstract IQ test
- The speakers are cautious about over-emphasizing abstract reasoning benchmarks
- They argue that such tests may not directly translate to real-world utility - Prioritize evaluations that test basic model functionality
- Prefer direct, quick-answer assessments without complex reasoning chains
- Skeptical of benchmarks that can be optimized without improving general capabilities- Practical considerations:
- Databricks' customers are more interested in practical AI applications
- Real-world use cases are more valuable than abstract reasoning tests
- Example of a useful AI: a model that can converse about data and run SQL queriesRetrieval-Augmented Generation and Long Context
- Retrieval-Augmented Generation (RAG):
- Considered "the world's simplest agent"
- Allows models to decide when to retrieve data from context/database
- Represents an early form of agent-like capability- Long context considerations:
- Long contexts are not inherently problematic
- Potential benefits include thousand-shot tasks as alternative to fine-tuning, pulling large amounts of data into context
- Inevitable in multimodal scenarios- Challenges with long context evaluation:
- Annotating long context is extremely difficult and expensive
- Requires human reading of massive token sets
- "Needle in a haystack" tests are problematic because they don't measure holistic context usage and can be gamed- Practical long context applications:
- Coding scenarios: Filtering relevant repository context
- Sorting contextual information by importance
- Avoiding computational waste during inferenceCode and Tool Use Approach
- Focus on robust code writing, execution, and debugging rather than hard-coded agents with limited tools
- Believe that improving code capabilities will dramatically expand potential actions and API interactions
- View code as a versatile "god tool" for solving complex problems
- Structured data interaction:
- Tool use is fundamentally about helping models interact effectively with structured data
- Recommend continuing to use structured data APIs and languages instead of flattening everything into an LLM context
- Text-to-SQL and backend SQL calls are particularly useful for practical applications- Challenges with current LLM approaches:
- Frustration with LLMs treating complex systems (like programming languages) as simple token streams
- Despite decades of understanding programming language structures, models are forced to relearn everything from scratch
- Current approach loses nuanced structural knowledge- Data representation considerations:
- Different tools suit different problems: SQL for tabular data, knowledge graphs for complex entity relationships
- Acknowledge knowledge graphs have limitations in handling messy, ambiguous real-world relationships
- No single universal data representation tool existsFuture Plans and Closing Thoughts
- Databricks team is focused on:
- Making AI capabilities practically useful for real-world workflows
- Improving code generation, understanding, testing, and verification
- Serving their 12,000 customers - Writing and sharing more blog posts about their scientific work
- Potentially releasing new models (including a teased small model called "Abra")
- Continuing to develop internal product prototypes- Calls to action and opportunities:
- Upcoming release at AI Engineer World's Fair (ai.engineer live stream)
- Job opportunities for those interested in coding reasoning, understanding hardware and model mechanics, and designing practical AI systems - Encouragement for those feeling exhausted or overwhelmed in the AI field
- Reminder that even in seemingly well-understood domains (like cluster setup or evaluations), there are still more insights to discover and substantial work left to do
- Despite feeling like the AI field is crowded and resource-intensive, there remains enormous potential for impactful work, significant unexplored areas, and opportunities for fresh perspectives