World Models & General Intuition: Khosla's largest bet since LLMs & OpenAI

Key Takeaways

General Intuition (GI) utilizes 3.8 billion game clips from Medal to train advanced world models.
GI's vision-based agents demonstrate human-like and superhuman gameplay, transferring learning to real-world video.
Medal's privacy-preserving, action-labeled retroactive clips provide a unique dataset for world model development.
GI secured $134 million from Khosla Ventures, opting for independence over a reported $500 million OpenAI acquisition offer.
World models are defined by their understanding of physics and prediction of full outcomes, distinct from video generation.
LLMs are considered orchestrators, while spatial-temporal foundation models are posited as core for physical interactions.
GI's "frames in, actions out" API can replace brittle game behaviors and control game controller-based robots.
The company emphasizes "negative events" in data for reinforcement learning, accelerating AI development.
By 2030, GI envisions its spatial-temporal models driving 80% of all AI-powered 'atoms-to-atoms' interactions.

General Intuition (GI) is a spin-off from Medal, a game clipping platform with 12 million users and 3.8 billion game clips.
GI aims to build world models using this extensive data, a frontier positioned beyond Large Language Models (LLMs).
The company raised a $134 million seed round from Khosla Ventures, declining a reported $500 million offer from OpenAI.
Early vision-based agents, trained via imitation learning, predict actions from video frames, progressing to human-like and superhuman performance.
These agents, trained on peak human gameplay highlights, scale significantly with more data and compute.

GI's models can label actions in any internet video using a 'frames in, actions out' approach, predicting keyboard and mouse actions.
The system has successfully transferred its learning from less realistic games to more realistic ones and to real-world video.
Transferring AI agent capabilities to real-world video highlights challenges in obtaining ground truth data.
The company is focused on distilling large policies into tiny real-time models that still navigate, hide, and peek corners like real players.
Real-world applications, even with YouTube data, require solving for pose estimation and potentially inverse dynamics, contrasting with simulated environments.

Medal's success stems from its native video recording and overlay features, accumulating extensive action-labeled footage from millions of creators.
The platform prioritizes privacy by logging player actions directly, rather than keystrokes, simplifying training data and avoiding privacy concerns.
This privacy-preserving data collection, combined with features like event capture, made Medal's dataset valuable for world models.
Medal's key innovation was its retroactive video recording, capturing gameplay in memory and exporting only desired sequences, similar to Tesla's bug reporting.
The platform's popularity accelerated during COVID-19, boosted by games like Fortnite and integration with Discord.

General Intuition's team leverages research projects like Genie, Sima, and Diamond in LMs and transformer-based models.
Sima research demonstrated an agent trained on nine games could navigate a tenth game comparably to an agent trained solely on that game, using a 9-1 holdout set.
The team aims to predict action embeddings from vision input for broader input transferability, moving beyond traditional game controller inputs.
The Diamond paper's ability to run a world model on a consumer GPU with limited data garnered significant interest from AI labs.
GI aims to replicate the 'internet-scale' data advantage of LLMs by leveraging their vast interaction data for spatial-temporal agent development.

Vinod Khosla made General Intuition's $134 million seed investment, the largest since OpenAI, after rigorously questioning the founders' long-term vision.
GI declined a reported $500 million offer from OpenAI for its proprietary video game data, opting for independence.
Founders with proprietary datasets are advised to model the data themselves to understand its capabilities or secure significant equity when licensing.
As model capabilities increase, the need for ground truth data may diminish, potentially making equity a more stable form of compensation.
GI decided to build its models internally to ensure alignment with its core business and avoid conflicts with industry partners.

World models are defined as systems that understand and predict a full range of outcomes based on current states and actions, incorporating physics and interactions.
This approach contrasts with video models that primarily focus on predicting likely or entertaining sequences.
The guest explains that the compute complexity of simulations increases rapidly with the number of agents and their degrees of freedom.
Stochastic environments favor video transfer and world models for their steerability via text over traditional simulation engines.
The guest contrasts their frame-based world model approach with Yann LeCun's splat-based method, preferring frames for alignment with training data and discrete action prediction.

The guest discusses limitations of Large Language Models (LLMs) as a primary backbone due to their non-spatial-temporal, autoregressive nature in dynamic real-world environments.
LLMs are viewed as valuable orchestrators, but spatial-temporal foundation models are argued as essential for handling complex, continuous interactions.
Speech and text generation are expected to become integrated actions within broader AI systems.
The world modeling approach is pursued partly because text generation is becoming commoditized, enabling focus on other AI development areas.
The long-term vision involves spatial-temporal agents using text reasoning from LLMs alongside spatial perception to solve complex scientific challenges in 3D.

General Intuition is working with major game developers and engines to replace deterministic player controllers with a 'frames in, actions out' API.
This enables AI agents to perceive and act within game environments, mimicking expert gamer intuition for human-like behaviors.
The technology extends beyond bots to create general agents capable of playing any game in real-time, mirroring behaviors in GTA5 and Truck Simulator.
Game developers are interested in using sophisticated AI bots to improve player retention during off-peak hours and provide engaging experiences.
The technology also aims to reduce data needed for training controller-based robots by transitioning from pre-training to post-training.

Medal's role as 'episodic memory for simulation' is central, aiming to make every clip playable within a world model.
This capability is key to progressing from imitation learning to reinforcement learning by analyzing billions of clips, especially those containing "negative events" (e.g., truck accidents).
General Intuition is pursuing open research, announcing a partnership with the open science lab QTI in Paris and seeking university collaborations.
The company is making an ambitious long-term bet on world models, emphasizing its data advantage for rapid learning and development.
By 2030, GI aims to be the standard for intelligence, with its spatial-temporal models driving 80% of AI-powered 'atoms-to-atoms' interactions, particularly in simulation due to fewer constraints.