After LLMs: Spatial Intelligence and World Models — Fei-Fei Li & Justin Johnson, World Labs | Latent Space: The AI Engineer Podcast Brief

Key Takeaways

World Labs, founded by Fei-Fei Li and Justin Johnson, focuses on spatial intelligence and world models.
Marble, their first product, generates editable 3D environments from various inputs using Gaussian splats.
Spatial intelligence is positioned as the next frontier beyond language models, crucial for 3D world understanding.
The discussion probes whether AI models genuinely understand physics or merely fit patterns in data.
Academia is severely under-resourced compared to industry, hindering foundational AI research.
Transformers are clarified as set models, not sequence models, impacting future architectural designs.
Marble enables use cases from game environments and VFX to architectural design and robotics training.
Future AI systems aim to integrate spatial and language intelligence for multimodal capabilities.
A millionfold increase in compute since Li's PhD enables processing of large spatial datasets.

AlexNet's reliance on ImageNet, GPUs, and neural networks is paralleled with necessary components for world models.
Data availability and a millionfold increase in compute power since Fei-Fei Li's PhD enable processing larger spatial and world datasets.

Fei-Fei Li expresses concern about academia's under-resourcing compared to industry, advocating for public sector AI work and a national AI compute cloud.
Academia's role should focus on exploring new algorithms, architectures, and theoretical underpinnings, rather than training the largest models.
Limits of current GPU scaling, like performance per watt between Hopper and Blackwell generations, suggest a need for long-term academic research into new approaches.

Fei-Fei Li and Andrej Karpathy's early work combined CNNs for image representation with LSTMs for language generation.
Their CVPR 2015 paper used a CNN-LSTM to generate a single descriptive sentence for an image.
The team advanced to 'dense captioning' in a 2016 CVPR paper, describing multiple objects within a scene using a complex neural network.
A real-time captioning demo streamed from California to Santiago, Chile, showcased the system's functionality at one frame per second.

A Harvard paper showed an LLM could predict orbital patterns but not force vectors, highlighting a gap between pattern-fitting and causal physics.
The debate centers on whether AI can achieve genuine causal reasoning or merely pattern fitting from data.
For applications like virtual backdrops, plausible rendering suffices, but for architectural design, understanding physical properties is crucial, posing a challenge for current models.

World Labs, co-founded by Fei-Fei Li and Justin Johnson, introduced Marble as their first product.
Marble is an in-class model generating high-fidelity 3D worlds from multimodal inputs with interactive editing capabilities.
The system uses Gaussian splats for real-time rendering on various devices and offers precise camera control, differentiating it from frame-by-frame generation models.

While Marble currently uses individual Gaussian splats, future architectures could integrate physics by attaching physical properties or using spring-like couplings.
Methods explored for dynamics include predicting physical properties or regenerating entire scenes.
The discussion considers computational demands and the role of classical physics engines versus learned simulations in 3D scene generation.

Spatial intelligence is defined as a distinct form of intelligence from linguistic intelligence, inspired by Howard Gardner's theory.
It is crucial for interacting with the 3D world, exemplified by tasks like deducing DNA structure or grasping a mug.
Language is presented as a low-bandwidth channel for describing the rich 3D/4D world, contrasted with high-bandwidth visual and spatial perception optimized over millions of years.

LLMs excel at abstract language tasks but may bypass crucial embodied, spatial understanding, as shown by benchmarks requiring physical reasoning.
Multimodal approaches, where language models accept spatial inputs like those used by World Labs' Marble, are a viable path forward.
Future AI systems aim to converge spatial and language intelligence, with language interaction remaining pragmatically useful.

The discussion clarifies that transformers are fundamentally set models, not sequence models.
Their operations are token-wise or permutation-equivariant through the attention mechanism.
Order is typically injected into transformers via positional embeddings.