AI Engineer World’s Fair in SF! Prices go up soon.

Note that there ">

Latent Space: The AI Engineer Podcast

How to train a Million Context LLM — with Mark Huang of Gradient.ai

Overview

  • Long context capabilities represent a significant frontier in AI development, with Gradient extending Llama 3's context window from 8,000 to 1 million tokens through techniques like RoPE scaling and specialized attention mechanisms, enabling applications from code repositories to finance.
  • The technical approach to context extension involves careful consideration of positional encodings, curriculum learning (progressively increasing context length), and data quality, with implementation challenges including GPU memory bandwidth utilization and network topology optimization.
  • Evaluation of long-context models requires sophisticated benchmarking beyond simple "needle in a haystack" tests, including multiple retrievals, variable tracking, and summary statistics generation, with performance degradation becoming apparent at extremely large contexts (4M+ tokens).
  • Multimodality is emerging as the next critical frontier in AI development, with early fusion models showing promise for integrating videos, images, and text in ways that provide genuine user value rather than just technical complexity.
  • The AI research landscape has experienced a 10x information explosion, requiring careful filtering strategies that prioritize practical applications, with Twitter and hands-on product testing proving more valuable for staying current than traditional academic conferences.

Content

Background and Professional Journey

  • Mark Wang is a former quantitative finance professional who transitioned to tech
  • Worked as lead data scientist at Box and staff ML scientist at Splunk
  • Moved from finance to tech to gain more experience with big data and machine learning at scale
  • Notes a trend of finance professionals moving into tech and AI
  • Sees current AI landscape as similar to previous "trading wars" in terms of talent competition
  • Feels empowered by OpenAI's developments to create impactful products

Gradient - Company Overview and Formation

  • Gradient is a full-stack AI platform
  • Goal: Enable enterprises to transition from traditional RPA (Robotic Process Automation) to more autonomous, "agentic" workflows
  • Aims to create a horizontal platform for AI workforce transformation
  • Formed a team with Chris Chang (former Meta/Google/Netflix engineer)
  • Motivated by challenges in enterprise ML platforms, particularly frequent workflow migrations
  • Goal was to reduce operational friction in shipping workloads

Agent Definition and Perspective

  • Mark defines an agent beyond just non-deterministic execution
  • Focuses on marginal improvements in probability of success at each workflow stage
  • Acknowledges "agent" is an overloaded term in current AI landscape
  • Emphasizes statistical approach to measuring agent effectiveness

Core Technical Vision

  • Focus on developing systems that can handle "out of domain" problems
  • Emphasize machine learning as a continuous learning process
  • Desire for AI systems that grow and adapt alongside users
  • Viewed the project as part of broader "meta learning" workflow
  • Interested in adaptable AI systems that can generalize across different domains

Long Context Learning Project

  • Chose to extend Llama 3's context length
  • Motivated by existing models' short context windows (8,000 tokens)
  • Inspired by Google's Gemini with 1 million token context length
  • Viewed language models as "compression algorithms"
  • Aimed to push context length boundaries

Computational Considerations

  • Acknowledged significant computational requirements
  • Worked with Crusoe (computational infrastructure provider) to facilitate the project
  • Recognized not everyone can easily undertake such computational challenges
  • Discussed GPU cloud providers and their collaboration to scale up computational resources using L40s GPU instances
  • Combined flash attention and ring attention for training
  • Ring attention is primarily about better GPU memory bandwidth utilization
  • Evaluated multiple implementation approaches for ring attention
  • Original JAX implementation was not GPU-friendly

Technical Approach to Context Length Extension

  • Self-attention has quadratic memory scaling, making longer context sequences computationally expensive
  • Ongoing research about the best approach to training long context models
  • Curriculum learning (progressively increasing context length) may perform better than training on maximum context length from the start
  • Meta research suggests incrementally increasing context length can improve model performance
  • Data quality is crucial when extending context length
  • Models need good perplexity scores before context length extension
  • The "theta" parameter plays a significant role in determining how far a context can be extended
  • Positional encodings and rope scaling are important technical mechanisms for context extension
  • Practical takeaway: With a 4k context model, you can potentially progressively increase context length if the model shows good initial performance

Technical Details on Model Embedding and Scaling

  • Focus on embedding mechanisms, particularly positional encoding techniques
  • Theta scaling described as an empirical method for adjusting embedding distributions
  • Goal is to achieve interpolation rather than extrapolation in model context
  • Approach was developed incrementally, starting at 256 tokens and scaling up
  • Most current architectures are using RoPE (Rotary Positional Embedding) scaling
  • Alibi is less commonly used in recent models
  • YARN can be used alongside RoPE scaling
  • Pose (a LoRa-based approach) shows some limitations in very long context scenarios
  • The scaling approach is empirical rather than mathematically proven
  • Scaling laws are observed but not guaranteed to continue consistently

Implementation Details

  • Discussed an open-source PyTorch implementation by John Payne for context extension
  • Preferred PyTorch over Jax for implementation
  • Adapted the implementation for their specific cluster network topology

Dataset and Training Approach

  • Conducted two-stage model updates:
- Initial pre-training layer using Slim Pajamas dataset - Chat dataset layer using Ultra Chat or derivatives
  • Focused on dataset considerations:
- Avoiding token truncation - Ensuring content diversity - Using embeddings for pre-filtering
  • Challenges in model training:
- Difficulty in injecting truly new knowledge into large language models - Models now trained on double-digit trillion tokens - Challenge of maintaining existing capabilities while introducing new information - Limited empirical research on expanding model decision boundaries
  • Cautious about assuming small token additions can significantly alter model knowledge
  • Referenced Lama 2 example where further training potentially degraded language capabilities
  • Emphasized the importance of maintaining model flexibility and generalizability

Advanced Training Techniques

  • Discussing challenges of model training, particularly avoiding overfitting to specific data types
  • Proposing multi-stage training with mixed data sources to prevent deviation
  • Suggesting potential improvements to loss functions to manage data overfitting
  • Using GPT-4 to rephrase and generate new training data tokens
  • Injecting out-of-domain, lower-probability data instances
  • Recognizing data pipeline creation as a potentially significant part of model development
  • Synthetic data generation can represent 25-50% of a dataset

Model Adaptation Techniques

  • Discussing LoRa (Low-Rank Adaptation) techniques for language models
  • Exploring model "alchemy" - mixing LoRa adapters and model merging
  • Comparing LoRa adaptations across different domains (language models vs. stable diffusion)
  • Techniques for merging machine learning models, particularly using LoRA layers
  • Observations that model merging can be effective for stylistic tasks but may struggle with more complex abilities
  • Merging techniques are seen as potentially "polluting" leaderboards by allowing strategic model combinations

Evaluation Challenges

  • Evaluating model performance is complex due to the high-dimensional, sparse nature of assessments
  • Multiple evaluations provide insights but not a complete picture
  • Highlighting difficulties in evaluating complex, advanced AI tasks
  • Referencing Gemini 1.5 paper's sophisticated evaluation methods (e.g., hiring experts, testing rare language translation)
  • Noting limitations for smaller organizations in conducting comprehensive model assessments

Long Context Benchmarking

  • "Needle in a haystack" benchmark is considered a primitive test for:
- Language understanding - Instruction following - Differentiating context and instructions - Preventing hallucinations
  • Ruler Benchmark includes four key evaluation types:
1. Multiple needle retrieval 2. Multi-value, multi-query evaluation 3. Variable tracking across context 4. Creating summary statistics (e.g., word counting)
  • Benchmarks are evolving to more comprehensively test models' contextual understanding
  • Growing interest in evaluating models' ability to handle and process long, complex contexts
  • Current benchmarks aim to push beyond simple retrieval methods

Context Window Scaling and Challenges

  • Discussed challenges with context retrieval across multiple documents
  • Expanded from 1 million to 4 million token context, noting performance degradation
  • Encountered technical challenges like floating point precision and potential activation clamping
  • Questioned the value of progressively larger context windows (1M, 2M, 10M, 100M tokens)
  • 1 million tokens seemed like a breakthrough milestone
  • 4 million tokens viewed as an incremental improvement
  • Technical challenges include:
- Managing "state" in interactive contexts - Dealing with retrieval brittleness - Preventing computational issues like exploding/vanishing gradients

Potential Use Cases for Large Context Windows

  • Code repositories (entire repo context)
  • Session-based state management
  • Finance sector applications with evolving concepts
  • Challenges in maintaining context across long interactions
  • Potential strategies:
- Retrieval augmented generation - Hierarchical recursive summarization - Iterative, agentic approaches - Targeted generation techniques

Future of AI and Multimodality

  • Long context is increasingly important for grounding language models
  • Multimodality is seen as pivotal, especially for integrating:
- Videos - Images - Charts - Medical images with text
  • Reference to Meta's Chameleon paper highlighting early fusion and multimodal training
  • Discussion of Sam Altman's perspective that AI models will be 10X better in coming years
  • Emphasis on building technology that provides real user value, not just technological complexity
  • The "10X direction" in AI keeps shifting (e.g., from reasoning to multimodality)
  • Current focus seems to be on multimodal capabilities and integration
  • Importance of listening to user needs and building practical, valuable technologies
  • Discussed early fusion models like Chameleon vs. late fusion models
  • Mentioned potential for retroactively integrating images into text encoders

Research and Learning Routine

  • Follows AI news primarily through Twitter, which provides faster updates than academic conferences
  • Monitors specific researchers like Armin from Meta for cutting-edge insights
  • Uses AI tools to search for latest papers on specific topics
  • Actively tries out new AI products to understand underlying research and techniques
  • Academic conferences like ICLR/ICML are often 6 months behind current research
  • Discord is valuable for seeing practical implementations and dataset discussions
  • Trying out products helps understand compressed research and state-of-the-art techniques

Information Filtering and Research Focus

  • Discusses filtering and identifying valuable information in the AI/tech space
  • Focuses on prioritizing useful research and techniques that are practically applicable
  • Areas of interest include:
- Evaluations - Post-training techniques - Synthetic data construction
  • Maintains a mental "cache" of existing state-of-the-art knowledge
  • Distinguishes between recycled empirical studies and truly insightful research
  • Highlighted the DeepSeek paper with multi-latent attention as an unexpected novel contribution

Challenges and Collaboration

  • Extremely noisy information space in AI research
  • Massive influx of information (10x explosion)
  • Risk of getting lost in unproductive research rabbit holes
  • Expressed openness to community collaboration on AI projects
  • Suggested that open-source collaboration could be a potential avenue for further research
  • Seeking collaboration on:
- Long context evaluations - Constructing multi-modal datasets (e.g., combining video and text)
  • Encouraging community contribution to developing comprehensive training datasets
  • Consults subject matter experts and discusses potential innovations with professional network

More from Latent Space: The AI Engineer Podcast

Explore all episode briefs from this podcast

View All Episodes →

Listen smarter with PodBrief

Get AI-powered briefs for all your favorite podcasts, plus a daily feed that keeps you informed.

Download on the App Store