Overview

Alternative transformer architectures are addressing the quadratic scaling problem of attention mechanisms, with innovations like State Space Models and Linear Attention achieving N log N compute complexity for long sequences while maintaining quality.

The RWKV (Receptance Weighted Key Value) architecture demonstrates that efficient models can match larger ones with significantly lower computational requirements, enabling AI deployment on low-power devices while achieving comparable performance to much larger models.

Research suggests that hybrid architectures combining traditional transformers with state-based models (like RWKV and Mamba) unexpectedly outperform baseline models, opening new avenues for model design that balances efficiency and capability.

The future of AI may not require ever-increasing context lengths, as fixed-state approaches with efficient reasoning capabilities could replace resource-intensive models for most enterprise applications while enabling new capabilities like real-time video generation.

Hardware-model co-design is emerging as a critical strategy, with innovations like the "Thunder Kittens" CUDA library optimizing for specific hardware capabilities to maximize performance of these alternative architectures.

Content

Introduction and Context

The discussion takes place at Latent Space Live mini-conference at NeurIPS 2024 in Vancouver
Speakers: Dan Few (Together AI, future UCSD faculty) and Eugene Chia (Featherless AI CEO)
Topic: Alternative transformer architectures and scaling challenges in AI models

Scaling Challenges in AI Models

Models have scaled dramatically in:

- Parameter size - Context length - Compute requirements (both training and inference)

Core technical challenge: Attention mechanisms scale quadratically with context length
Current approach requires increasingly large data centers and computational resources
Key research question: Can sequence models be scaled more efficiently?

Evolution of Attention Mechanisms

Standard attention is quadratic: each token is compared to every other token
This becomes computationally expensive for large inputs (like entire books)

Linear Attention (circa 2020):

- Attempted to remove the softmax non-linearity - Aimed to avoid the quadratic computational bottleneck - Initially faced challenges with result quality and hardware inefficiency

State Space Models (2022):

- Breakthrough in post-transformer architectures - Incorporated signal processing and dynamical systems modeling techniques - Brought two significant innovations: 1. Improved model quality using sophisticated recurrent update models 2. More efficient computation methods - Showed improvements in long sequence evaluations, time series analysis, and long-range arena tests

Advances in Non-Transformer Sequence Models (2022-2023)

Key Breakthrough Ideas:

- Reformulating recurrent models as convolutions - Computing convolutions efficiently using Fast Fourier Transform (FFT) - Achieving N log N compute complexity for sequence length N

Performance Optimization:

- Development of specialized kernels (e.g., flash FFT conv) - Focus on improving wall clock speed and hardware efficiency - Creating kernels optimized for modern GPU hardware

Selection Mechanisms in Sequence Models:

- Introducing element-wise gates to improve model performance - Examples: H3 (Hungry Hungry Hippos), Hyena models, Mamba - Making state space model matrices data-dependent - Allowing better selection of information from hidden states

Linear Attention Developments:

- Revival of linear attention techniques - Example: BASE model using Taylor approximation of softmax - Improving memory recall vs. recurrent state size trade-off

Emerging Model Architectures

Non-transformer architectures showing promise across various applications
Notable recent developments:

- Jamba: A hybrid Mixture of Experts (MOE) model - SANA: A diffusion model using linear attention - Gated State Space Models (SSM) applied to DNA modeling

RWKV Architecture Deep Dive

RWKV (Receptance Weighted Key Value) aims to:

- Lower computational costs - Enable running models on low-power devices like Raspberry Pis - Break traditional RNN token flow dependencies

Key architectural innovations:

- Two main blocks: "time mix" and "channel mix" - Time mix handles long-term memory states - Channel mix handles shorter-term token interactions - Iterative development (currently at version 7, codenamed "Goose")

QRWKV6 Development:

- Converted a 32B parameter model by: * Freezing feedforward layers * Removing original attention layers * Replacing with RWKV linear layers - Trained in stages with a custom learning rate schedule - Achieved performance on par with original QN32B model after just a few hours of training - MMLU score dropped to 76%, which was expected due to "brain damage" to the feedforward network

Hybrid Model Insights

Currently converting SMTP model to test attention mechanics
Exploring hybrid architectures with state-based models (RWKV, Mamba)
Surprisingly, hybrid models outperform baseline models, though reasons are not fully understood
Conversion process may not be exclusive to RWKV and could work with other models

Hardware and Model Co-design

Developed "Thunder Kittens", a CUDA library focused on matrix-based compute primitives
Goal is to simplify CUDA development and optimize for hardware like H100's warp group matrix multiply
Emphasizes designing model architectures with specific hardware capabilities in mind

Context and Memory in AI Models

Exploring how models might handle extensive context without quadratic computational complexity
Comparing AI memory to human memory (humans don't remember everything perfectly)
Examining models like RWKV, state space, and XLSTM with potential for "infinite context"
Proposing a fixed state size approach instead of exponentially expanding computational costs
Noting state size limitations (e.g., RWKV currently at 14MB, potentially expanding to 400MB)

Practical Limitations of Long Context Models

Long context doesn't necessarily matter as much as people think
Very few people are actually using extremely long context prompts (2-3 million tokens)
Big tech labs (like Google, OpenAI) are driving long context research more than others
VRAM consumption is a significant bottleneck for training long context models
O(1) style reasoning might be more promising than simply increasing context length
Smaller, more efficient models that can reason effectively might be preferable to large, resource-intensive models

Future Directions

Interested in scaling to 128k context length
Potential to replace majority of current enterprise AI workloads
Exploring next-generation models beyond language, including real-time video generation
Developing models that can "watch forever" without becoming unstable
Creating evaluation metrics for long-context model performance
Focusing on model architectures that remain stable across varying context lengths

Applications and Strengths

These alternative models are particularly effective for time series data modeling
Especially strong in predicting future outcomes (e.g., weather forecasting)
Potentially superior to transformer architectures when controlled for parameter size
Promising for applications focused on predicting "what's next" rather than analyzing distant past events

2024 in Post-Transformers Architectures (State Space Models, RWKV) [LS Live @ NeurIPS]