Key Takeaways
- General Intuition (GI) utilizes 3.8 billion game clips from Medal to train advanced world models.
- GI's vision-based agents demonstrate human-like and superhuman gameplay, transferring learning to real-world video.
- Medal's privacy-preserving, action-labeled retroactive clips provide a unique dataset for world model development.
- GI secured $134 million from Khosla Ventures, opting for independence over a reported $500 million OpenAI acquisition offer.
- World models are defined by their understanding of physics and prediction of full outcomes, distinct from video generation.
- LLMs are considered orchestrators, while spatial-temporal foundation models are posited as core for physical interactions.
- GI's "frames in, actions out" API can replace brittle game behaviors and control game controller-based robots.
- The company emphasizes "negative events" in data for reinforcement learning, accelerating AI development.
- By 2030, GI envisions its spatial-temporal models driving 80% of all AI-powered 'atoms-to-atoms' interactions.
Deep Dive
- General Intuition (GI) is a spin-off from Medal, a game clipping platform with 12 million users and 3.8 billion game clips.
- GI aims to build world models using this extensive data, a frontier positioned beyond Large Language Models (LLMs).
- The company raised a $134 million seed round from Khosla Ventures, declining a reported $500 million offer from OpenAI.
- Early vision-based agents, trained via imitation learning, predict actions from video frames, progressing to human-like and superhuman performance.
- These agents, trained on peak human gameplay highlights, scale significantly with more data and compute.
- GI's models can label actions in any internet video using a 'frames in, actions out' approach, predicting keyboard and mouse actions.
- The system has successfully transferred its learning from less realistic games to more realistic ones and to real-world video.
- Transferring AI agent capabilities to real-world video highlights challenges in obtaining ground truth data.
- The company is focused on distilling large policies into tiny real-time models that still navigate, hide, and peek corners like real players.
- Real-world applications, even with YouTube data, require solving for pose estimation and potentially inverse dynamics, contrasting with simulated environments.
- Medal's success stems from its native video recording and overlay features, accumulating extensive action-labeled footage from millions of creators.
- The platform prioritizes privacy by logging player actions directly, rather than keystrokes, simplifying training data and avoiding privacy concerns.
- This privacy-preserving data collection, combined with features like event capture, made Medal's dataset valuable for world models.
- Medal's key innovation was its retroactive video recording, capturing gameplay in memory and exporting only desired sequences, similar to Tesla's bug reporting.
- The platform's popularity accelerated during COVID-19, boosted by games like Fortnite and integration with Discord.
- General Intuition's team leverages research projects like Genie, Sima, and Diamond in LMs and transformer-based models.
- Sima research demonstrated an agent trained on nine games could navigate a tenth game comparably to an agent trained solely on that game, using a 9-1 holdout set.
- The team aims to predict action embeddings from vision input for broader input transferability, moving beyond traditional game controller inputs.
- The Diamond paper's ability to run a world model on a consumer GPU with limited data garnered significant interest from AI labs.
- GI aims to replicate the 'internet-scale' data advantage of LLMs by leveraging their vast interaction data for spatial-temporal agent development.
- Vinod Khosla made General Intuition's $134 million seed investment, the largest since OpenAI, after rigorously questioning the founders' long-term vision.
- GI declined a reported $500 million offer from OpenAI for its proprietary video game data, opting for independence.
- Founders with proprietary datasets are advised to model the data themselves to understand its capabilities or secure significant equity when licensing.
- As model capabilities increase, the need for ground truth data may diminish, potentially making equity a more stable form of compensation.
- GI decided to build its models internally to ensure alignment with its core business and avoid conflicts with industry partners.
- World models are defined as systems that understand and predict a full range of outcomes based on current states and actions, incorporating physics and interactions.
- This approach contrasts with video models that primarily focus on predicting likely or entertaining sequences.
- The guest explains that the compute complexity of simulations increases rapidly with the number of agents and their degrees of freedom.
- Stochastic environments favor video transfer and world models for their steerability via text over traditional simulation engines.
- The guest contrasts their frame-based world model approach with Yann LeCun's splat-based method, preferring frames for alignment with training data and discrete action prediction.
- The guest discusses limitations of Large Language Models (LLMs) as a primary backbone due to their non-spatial-temporal, autoregressive nature in dynamic real-world environments.
- LLMs are viewed as valuable orchestrators, but spatial-temporal foundation models are argued as essential for handling complex, continuous interactions.
- Speech and text generation are expected to become integrated actions within broader AI systems.
- The world modeling approach is pursued partly because text generation is becoming commoditized, enabling focus on other AI development areas.
- The long-term vision involves spatial-temporal agents using text reasoning from LLMs alongside spatial perception to solve complex scientific challenges in 3D.
- General Intuition is working with major game developers and engines to replace deterministic player controllers with a 'frames in, actions out' API.
- This enables AI agents to perceive and act within game environments, mimicking expert gamer intuition for human-like behaviors.
- The technology extends beyond bots to create general agents capable of playing any game in real-time, mirroring behaviors in GTA5 and Truck Simulator.
- Game developers are interested in using sophisticated AI bots to improve player retention during off-peak hours and provide engaging experiences.
- The technology also aims to reduce data needed for training controller-based robots by transitioning from pre-training to post-training.
- Medal's role as 'episodic memory for simulation' is central, aiming to make every clip playable within a world model.
- This capability is key to progressing from imitation learning to reinforcement learning by analyzing billions of clips, especially those containing "negative events" (e.g., truck accidents).
- General Intuition is pursuing open research, announcing a partnership with the open science lab QTI in Paris and seeking university collaborations.
- The company is making an ambitious long-term bet on world models, emphasizing its data advantage for rapid learning and development.
- By 2030, GI aims to be the standard for intelligence, with its spatial-temporal models driving 80% of AI-powered 'atoms-to-atoms' interactions, particularly in simulation due to fewer constraints.