Latent Space LIVE! through the break bringing you the best of 2024! We want t">

Latent Space: The AI Engineer Podcast

2024 in Post-Transformers Architectures (State Space Models, RWKV) [LS Live @ NeurIPS]

Overview

  • Alternative transformer architectures are addressing the quadratic scaling problem of attention mechanisms, with innovations like State Space Models and Linear Attention achieving N log N compute complexity for long sequences while maintaining quality.
  • The RWKV (Receptance Weighted Key Value) architecture demonstrates that efficient models can match larger ones with significantly lower computational requirements, enabling AI deployment on low-power devices while achieving comparable performance to much larger models.
  • Research suggests that hybrid architectures combining traditional transformers with state-based models (like RWKV and Mamba) unexpectedly outperform baseline models, opening new avenues for model design that balances efficiency and capability.
  • The future of AI may not require ever-increasing context lengths, as fixed-state approaches with efficient reasoning capabilities could replace resource-intensive models for most enterprise applications while enabling new capabilities like real-time video generation.
  • Hardware-model co-design is emerging as a critical strategy, with innovations like the "Thunder Kittens" CUDA library optimizing for specific hardware capabilities to maximize performance of these alternative architectures.

Content

Introduction and Context

  • The discussion takes place at Latent Space Live mini-conference at NeurIPS 2024 in Vancouver
  • Speakers: Dan Few (Together AI, future UCSD faculty) and Eugene Chia (Featherless AI CEO)
  • Topic: Alternative transformer architectures and scaling challenges in AI models

Scaling Challenges in AI Models

  • Models have scaled dramatically in:
- Parameter size - Context length - Compute requirements (both training and inference)
  • Core technical challenge: Attention mechanisms scale quadratically with context length
  • Current approach requires increasingly large data centers and computational resources
  • Key research question: Can sequence models be scaled more efficiently?

Evolution of Attention Mechanisms

  • Standard attention is quadratic: each token is compared to every other token
  • This becomes computationally expensive for large inputs (like entire books)
  • Linear Attention (circa 2020):
- Attempted to remove the softmax non-linearity - Aimed to avoid the quadratic computational bottleneck - Initially faced challenges with result quality and hardware inefficiency

  • State Space Models (2022):
- Breakthrough in post-transformer architectures - Incorporated signal processing and dynamical systems modeling techniques - Brought two significant innovations: 1. Improved model quality using sophisticated recurrent update models 2. More efficient computation methods - Showed improvements in long sequence evaluations, time series analysis, and long-range arena tests

Advances in Non-Transformer Sequence Models (2022-2023)

  • Key Breakthrough Ideas:
- Reformulating recurrent models as convolutions - Computing convolutions efficiently using Fast Fourier Transform (FFT) - Achieving N log N compute complexity for sequence length N

  • Performance Optimization:
- Development of specialized kernels (e.g., flash FFT conv) - Focus on improving wall clock speed and hardware efficiency - Creating kernels optimized for modern GPU hardware

  • Selection Mechanisms in Sequence Models:
- Introducing element-wise gates to improve model performance - Examples: H3 (Hungry Hungry Hippos), Hyena models, Mamba - Making state space model matrices data-dependent - Allowing better selection of information from hidden states

  • Linear Attention Developments:
- Revival of linear attention techniques - Example: BASE model using Taylor approximation of softmax - Improving memory recall vs. recurrent state size trade-off

Emerging Model Architectures

  • Non-transformer architectures showing promise across various applications
  • Notable recent developments:
- Jamba: A hybrid Mixture of Experts (MOE) model - SANA: A diffusion model using linear attention - Gated State Space Models (SSM) applied to DNA modeling

RWKV Architecture Deep Dive

  • RWKV (Receptance Weighted Key Value) aims to:
- Lower computational costs - Enable running models on low-power devices like Raspberry Pis - Break traditional RNN token flow dependencies

  • Key architectural innovations:
- Two main blocks: "time mix" and "channel mix" - Time mix handles long-term memory states - Channel mix handles shorter-term token interactions - Iterative development (currently at version 7, codenamed "Goose")

  • QRWKV6 Development:
- Converted a 32B parameter model by: * Freezing feedforward layers * Removing original attention layers * Replacing with RWKV linear layers - Trained in stages with a custom learning rate schedule - Achieved performance on par with original QN32B model after just a few hours of training - MMLU score dropped to 76%, which was expected due to "brain damage" to the feedforward network

Hybrid Model Insights

  • Currently converting SMTP model to test attention mechanics
  • Exploring hybrid architectures with state-based models (RWKV, Mamba)
  • Surprisingly, hybrid models outperform baseline models, though reasons are not fully understood
  • Conversion process may not be exclusive to RWKV and could work with other models

Hardware and Model Co-design

  • Developed "Thunder Kittens", a CUDA library focused on matrix-based compute primitives
  • Goal is to simplify CUDA development and optimize for hardware like H100's warp group matrix multiply
  • Emphasizes designing model architectures with specific hardware capabilities in mind

Context and Memory in AI Models

  • Exploring how models might handle extensive context without quadratic computational complexity
  • Comparing AI memory to human memory (humans don't remember everything perfectly)
  • Examining models like RWKV, state space, and XLSTM with potential for "infinite context"
  • Proposing a fixed state size approach instead of exponentially expanding computational costs
  • Noting state size limitations (e.g., RWKV currently at 14MB, potentially expanding to 400MB)

Practical Limitations of Long Context Models

  • Long context doesn't necessarily matter as much as people think
  • Very few people are actually using extremely long context prompts (2-3 million tokens)
  • Big tech labs (like Google, OpenAI) are driving long context research more than others
  • VRAM consumption is a significant bottleneck for training long context models
  • O(1) style reasoning might be more promising than simply increasing context length
  • Smaller, more efficient models that can reason effectively might be preferable to large, resource-intensive models

Future Directions

  • Interested in scaling to 128k context length
  • Potential to replace majority of current enterprise AI workloads
  • Exploring next-generation models beyond language, including real-time video generation
  • Developing models that can "watch forever" without becoming unstable
  • Creating evaluation metrics for long-context model performance
  • Focusing on model architectures that remain stable across varying context lengths

Applications and Strengths

  • These alternative models are particularly effective for time series data modeling
  • Especially strong in predicting future outcomes (e.g., weather forecasting)
  • Potentially superior to transformer architectures when controlled for parameter size
  • Promising for applications focused on predicting "what's next" rather than analyzing distant past events

More from Latent Space: The AI Engineer Podcast

Explore all episode briefs from this podcast

View All Episodes →

Listen smarter with PodBrief

Get AI-powered briefs for all your favorite podcasts, plus a daily feed that keeps you informed.

Download on the App Store