AI Engineer World’s Fair have been

Latent Space: The AI Engineer Podcast

ICLR 2024 — Best Papers & Talks (ImageGen, Vision, Transformers, State Space Models) ft. Durk Kingma, Christian Szegedy, Ilya Sutskever

Overview

Content: ICLR Conference Coverage on Variational Autoencoders and Advanced ML Research

Podcast Context and Introduction

Variational Autoencoders (VAEs) and Their Evolution

Historical Context and Development

- Basic autoencoders: Mapping high-dimensional signals to low-dimensional code space (Mark Kramer, 1991) - Denoising autoencoders (2008): Adding noise to input and reconstructing - Neural inpainting: Reconstructing missing image parts - Helmholtz machine: Early predecessor with recognition and generative networks

VAE Technical Details

- Mean of distribution - Standard deviation of distribution - Reconstruction loss - KL divergence (pushing latent distribution toward normal Gaussian) - Solves gradient backpropagation problem in sampling nodes - Splits sampling into learnable parameters (mu and sigma) and stochastic epsilon term - Enables end-to-end training through gradient computation

VAE Significance and Innovation

- Amortized inference: inference model q(z|x) approximates true posterior p(z|x) - Reparameterization trick - Lower bound optimization for encoder and decoder

VAE Demonstrations and Challenges

- Even distribution of data points in latent space - Class-based clustering - First color generative models - Disentanglement of class labels from image style - Reverse scale optimization - Unstable targets when changing inference model - Potential for posterior collapse

VAE Applications Across Domains

Advanced Generative Models Research

Versgen/Stable Cascade

- Stage A: VQ CAN with low-level compression - Stage B: Autoencoder with efficient encoder and powerful diffusion decoder - Stage C: Text-conditional generation of highly compressed latents - Reduces sequence length from 16,384 to 576 through 42x compression - Enables faster training and inference - Provides significant compute savings compared to previous models

Diffusion Model Interpretability

- Uses Stable Diffusion's CLIP vocabulary as feature prototype - Trains lean MLP to map vocabulary tokens to coefficients - Learns decomposition through linear combination of tokens - Sweet peppers generated as finger-shaped - Camel connected to cashmere via texture and color - Snake decomposed into host and gecko - Models can interpolate between dual meanings (crane as bird/machine, bass as fish/guitar)

Controllability and Interpretability

Unsupervised Learning Perspectives

- Compression fundamentally equivalent to prediction - Joint compression of datasets (X and Y) extracts shared structure - "Algorithmic mutual information" represents shared patterns between datasets - Language models trained only on text can compress images decently - Randomizing transformer embedding tables still maintains good next-token prediction

Adversarial Machine Learning Research

- Adversarial examples require intentional manipulation, not random misclassification - Goal is finding minimal perturbations that change classification output - Adversarial examples work across different model types and transfer between models - Deeper neural networks are more susceptible to adversarial attacks

Vision Transformer Research

- CLS token often attends strongly to few specific image patches - These patches appear in seemingly random background areas - Some tokens have extremely high norm values (around 500) - Have no initial image information - Are not used in loss calculation - Can interact with other tokens through self-attention - Improved attention maps and performance on various tasks

Language Model Innovations

Pause Tokens Research

Long Context Extension Methods

- Splits context into groups and conducts attention individually - Compatible with existing attention mechanisms - Can fine-tune 7B parameter model to 100,000 tokens on 8 GPUs - Builds upon Positional Interpolation - Recognizes different rotation speeds of model dimensions - Selectively extends/interpolates dimensions - Requires minimal fine-tuning (approximately 1% of pre-training dataset)

KV Cache Compression (FastGen)

State Space Models and Mamba Architecture

- Fixed-sized memory in hidden state - Sub-quadratic training complexity - Matrix-valued hidden state with three dimensions - Introduces data-dependent variance by adding subscript k to model parameters - Allows dynamic parameter adjustment at each position - Enables filtering and state resetting - Byte-level language modeling without tokenization - Diffusion models (Diffusum) using state space models instead of self-attention - Outperforms transformers in both parameter-matched and FLOP-matched settings

Training Optimization Research

- Forward path optimization using block-based quantization - Backward path optimization with heterogeneous partitioning - Novel all-to-all collective design for gradient handling - Achieved over 2x speedup on InfiniBand - Focused on large-scale training optimization - Recently developed Zero++ with communication overhead reduction - Next focus on synchronized communication and computation overlapping

Model Evaluation Methodology

More from Latent Space: The AI Engineer Podcast

Explore all episode briefs from this podcast

View All Episodes →

Listen smarter with PodBrief

Get AI-powered briefs for all your favorite podcasts, plus a daily feed that keeps you informed.

Download on the App Store