ICLR 2024 — Best Papers & Talks (ImageGen, Vision, Transformers, State Space Models) ft. Durk Kingma, Christian Szegedy, Ilya Sutskever
Overview
The Variational Autoencoder (VAE) paper by Kingma and Welling received ICLR's inaugural Test of Time Award, representing a breakthrough that bridged deep learning with probabilistic models through innovations like the reparameterization trick and amortized inference, enabling applications across recommender systems, healthcare, chemistry, and physics.
Recent text-to-image diffusion model advances include Stable Cascade's three-stage architecture that achieves 42x compression of image latents, significantly reducing computational requirements while improving quality, alongside new methods for understanding and controlling internal concept representations.
State Space Models and the Mamba architecture offer compelling alternatives to transformers for handling long contexts with sub-quadratic complexity, introducing data-dependent parameters that enable dynamic filtering and state resetting, showing superior performance in byte-level language modeling and diffusion models.
Research on extending context windows in language models has produced multiple effective approaches including Long Laura's shifted sparse attention, YARN's selective dimension interpolation, and Long Rope's rotary position encoding, enabling models to process sequences of 100,000+ tokens with minimal fine-tuning.
Training and inference optimization techniques like Zero++ and FastGen's KV cache compression are addressing computational bottlenecks, with innovations in quantization, gradient handling, and memory management achieving up to 40% memory reduction and 2x training speedups for large language models.
Content: ICLR Conference Coverage on Variational Autoencoders and Advanced ML Research
Podcast Context and Introduction
Latent Space podcast episode covering the International Conference on Learning Representations (ICLR) in Vienna
First of two episodes focusing on academic research presentations
Variational Autoencoders (VAEs) and Their Evolution
Historical Context and Development
Inaugural ICLR Test of Time Award went to Kingma and Welling for "Auto-encoding Variational Bayes" paper
Evolution of autoencoder technology:
- Basic autoencoders: Mapping high-dimensional signals to low-dimensional code space (Mark Kramer, 1991)
- Denoising autoencoders (2008): Adding noise to input and reconstructing
- Neural inpainting: Reconstructing missing image parts
- Helmholtz machine: Early predecessor with recognition and generative networks
VAE Technical Details
Maps input to a distribution instead of a fixed vector using two vectors:
- Mean of distribution
- Standard deviation of distribution
Loss function has two components:
- Reconstruction loss
- KL divergence (pushing latent distribution toward normal Gaussian)
Reparameterization trick:
- Solves gradient backpropagation problem in sampling nodes
- Splits sampling into learnable parameters (mu and sigma) and stochastic epsilon term
- Enables end-to-end training through gradient computation
VAE Significance and Innovation
Represents convergence of deep learning and probabilistic models
Introduced novel techniques:
- Amortized inference: inference model q(z|x) approximates true posterior p(z|x)
- Reparameterization trick
- Lower bound optimization for encoder and decoder
Developed concurrently by multiple researchers (Kingma et al. and Razende et al.)
VAE Demonstrations and Challenges
Early demonstrations showed:
- Even distribution of data points in latent space
- Class-based clustering
- First color generative models
- Disentanglement of class labels from image style
Training challenges included:
- Reverse scale optimization
- Unstable targets when changing inference model
- Potential for posterior collapse
VAE Applications Across Domains
Recommender systems: Maps discrete user-item interactions to continuous latent space
Video compression: Natural mechanism for lossy compression
Healthcare: Dr. VAE maps gene expression data to track treatment effects
Chemistry: Bayesian optimization of molecules in continuous latent space
Genetics: Predicting disease-related protein amino acids
Astronomy: Reconstructing galaxies from gravitationally lensed images
High energy physics: Detecting anomalous particle collision events
Advanced Generative Models Research
Versgen/Stable Cascade
Three-stage architecture for text-to-image diffusion models:
- Stage A: VQ CAN with low-level compression
- Stage B: Autoencoder with efficient encoder and powerful diffusion decoder
- Stage C: Text-conditional generation of highly compressed latents
Efficiency improvements:
- Reduces sequence length from 16,384 to 576 through 42x compression
- Enables faster training and inference
- Provides significant compute savings compared to previous models
Evaluation showed Wurstian model outperformed StableDiffusion in both algorithmic and subjective evaluations
Diffusion Model Interpretability
Method to understand internal representations of concepts in diffusion models:
- Uses Stable Diffusion's CLIP vocabulary as feature prototype
- Trains lean MLP to map vocabulary tokens to coefficients
- Learns decomposition through linear combination of tokens
Reveals interesting connections models make:
- Sweet peppers generated as finger-shaped
- Camel connected to cashmere via texture and color
- Snake decomposed into host and gecko
- Models can interpolate between dual meanings (crane as bird/machine, bass as fish/guitar)
Controllability and Interpretability
Research connects model interpretability with controllability
Using internal representations (attention maps) as loss functions helps address issues like models neglecting input prompt subjects
Unsupervised Learning Perspectives
Discussion on mathematical foundations for unsupervised learning
Proposed framework: Distribution matching as finding function F where F(X) distribution matches Y distribution
Compression as framework for unsupervised learning:
- Compression fundamentally equivalent to prediction
- Joint compression of datasets (X and Y) extracts shared structure
- "Algorithmic mutual information" represents shared patterns between datasets
Formalization through compression/prediction framework with minimal "regret"
Interesting findings:
- Language models trained only on text can compress images decently
- Randomizing transformer embedding tables still maintains good next-token prediction
Adversarial Machine Learning Research
ICLR Test of Time Award for paper on adversarial examples in neural networks
First paper to highlight robustness problems in deep neural networks
Key insights:
- Adversarial examples require intentional manipulation, not random misclassification
- Goal is finding minimal perturbations that change classification output
- Adversarial examples work across different model types and transfer between models
- Deeper neural networks are more susceptible to adversarial attacks
Vision Transformer Research
Investigation into vision transformer attention map artifacts:
- CLS token often attends strongly to few specific image patches
- These patches appear in seemingly random background areas
- Some tokens have extremely high norm values (around 500)
- Have no initial image information
- Are not used in loss calculation
- Can interact with other tokens through self-attention
- Improved attention maps and performance on various tasks
Language Model Innovations
Pause Tokens Research
Adding "pause tokens" to language models to improve performance
Integrated during pre-training stage at random positions (10% frequency)
Showed performance gains in reasoning tasks, reading comprehension, and natural understanding benchmarks
Long Context Extension Methods
Long Laura: Extends context window using shifted sparse attention
- Splits context into groups and conducts attention individually
- Compatible with existing attention mechanisms
- Can fine-tune 7B parameter model to 100,000 tokens on 8 GPUs
YARN: Context window extension method
- Builds upon Positional Interpolation
- Recognizes different rotation speeds of model dimensions
- Selectively extends/interpolates dimensions
- Requires minimal fine-tuning (approximately 1% of pre-training dataset)
Long Rope (by Microsoft): Currently most promising rotary position encoding method
KV Cache Compression (FastGen)
Addresses memory consumption challenge in LLM inference
On-the-fly KV cache eviction algorithm
Model agnostic (applicable to any autoregressive LLM)
Achieves 40% memory reduction for Llama 67B model
Different attention heads focus on special tokens, local context, or broad context
In most layers, special tokens can recover ~99% of attention map
State Space Models and Mamba Architecture
Alternative to transformer architecture for handling long contexts
Key characteristics:
- Fixed-sized memory in hidden state
- Sub-quadratic training complexity
- Matrix-valued hidden state with three dimensions
Mamba innovations:
- Introduces data-dependent variance by adding subscript k to model parameters
- Allows dynamic parameter adjustment at each position
- Enables filtering and state resetting
Applications:
- Byte-level language modeling without tokenization
- Diffusion models (Diffusum) using state space models instead of self-attention
- Outperforms transformers in both parameter-matched and FLOP-matched settings
Training Optimization Research
Zero Plus Plus optimization techniques:
- Forward path optimization using block-based quantization
- Backward path optimization with heterogeneous partitioning
- Novel all-to-all collective design for gradient handling
- Achieved over 2x speedup on InfiniBand
DeepSpeed improvements:
- Focused on large-scale training optimization
- Recently developed Zero++ with communication overhead reduction
- Next focus on synchronized communication and computation overlapping
Model Evaluation Methodology
Self-Pre-Training (SPT) method for fairer evaluation of inductive bias
Pre-trains on downstream task data before fine-tuning
Improved transformer performance on Long-Range Arena benchmarks by over 30%
Challenges previous assumptions about transformer limitations in long-sequence modeling