What Comes After ChatGPT? The Mother of ImageNet Predicts The Future

Key Takeaways

Marble is a new generative model creating interactive 3D worlds from text or images.
Spatial intelligence, distinct from language, is crucial for AI's real-world understanding and interaction.
Integrating physics into world models is a key challenge for AI beyond simple pattern recognition.
Academia faces resource constraints in pursuing foundational AI research and hardware innovation.
Transformers are fundamentally set models, not sequence models, due to their attention mechanism.

Fei-Fei Li, creator of ImageNet, and her former PhD student Justin Johnson, co-founded World Labs.
They launched Marble, a generative model designed to create 3D worlds from text or images for practical use.
Li and Johnson reunited to focus on building world models after Johnson's career as a professor and Meta researcher.

The discussion questions whether AI models truly 'understand' physics or merely replicate patterns from data.
An experiment proposed feeding astrophysical data to AI to derive Newtonian laws, but suggested it might struggle with abstract principles like 'F=MA'.
This highlights a divergence between deep learning's pattern fitting and human intelligence's grasp of causal laws.

Academia is concerned about mimicking industry by focusing on training the largest AI models.
Under-resourcing hinders exploration of innovative, foundational research and theoretical underpinnings.
Discussions include the potential for new computational primitives beyond matrix multiplication for future hardware scaling.

Fei-Fei Li's team extended image captioning to 'dense captioning' in a 2016 paper.
This system drew bounding boxes around objects and generated captions for each part of a scene.
A live web demo streamed real-time predictions from a server in California to a conference in Santiago, Chile.

Marble is introduced as the first generative model for 3D worlds that allows user interaction and export in various formats.
It aims to be a practical product for industries such as gaming and visual effects (VFX).
Marble features precise camera control, recording capabilities, and enables real-time client-side rendering.

Current Marble generations primarily utilize Gaussian splats as their fundamental data structure.
Future research explores integrating physical properties and simulation directly into Gaussian splats.
Potential methods include predicting physical properties for each splat or dynamically regenerating scenes.

Spatial intelligence is defined as the ability to reason, understand, move, and interact in space.
It is considered complementary to linguistic intelligence, aligning with Howard Gardner's theory of multiple intelligences.
Human spatial intelligence evolved over 540 million years, compared to language's estimated 500,000 years.

Language model benchmarks often fail to identify physical impossibilities, lacking an internal 3D world representation.
Multimodal models, like Marble, are considered essential for achieving social intelligence and beneficial for user interaction.
Practical AI applications benefit from multimodal inputs, even if purely vision or spatial models might be academic exercises.

The discussion clarifies that Transformers are fundamentally considered set models, not sequence models.
This is attributed to their permutation-equivariant attention mechanism and per-token operations.
Order within Transformers is primarily introduced through positional embeddings, rather than inherent sequential processing.