How Intelligent Is AI, Really?

Key Takeaways

ARC-AGI redefines AI intelligence measurement beyond memorization.
It focuses on an AI's ability to learn new, unfamiliar tasks efficiently.
ARC benchmarks, unlike others, are designed for human solvability.
Major AI labs are now adopting ARC-AGI as a standard metric.
Future ARC-AGI versions will test learning efficiency in interactive environments.

The ArchPrize Foundation is a tech-forward nonprofit focused on advancing general intelligence in AI.
It defines intelligence as the ability to learn new things efficiently, a concept proposed by Francois Cholet in 2019.
This contrasts with traditional AI benchmarks that focus on task-specific performance or increasing problem difficulty.
The ARC AGI benchmark tests an AI's ability to learn new things, a capability measurable by both humans and machines.

Early Large Language Models (LLMs) scored approximately 4-5% on ARC benchmarks in 2019.
Performance dramatically increased to 21% after new models introduced a reasoning paradigm.
Major AI labs, including OpenAI and XAI, are now using ARC-AGI in their model releases.
The foundation aims to inspire broader research, remaining mindful of potential vanity metrics.

ARC-AGI 1 (2019) and ARC-AGI 2 (March 2025) are static benchmarks.
ARC-AGI 3, planned for release next year, will introduce approximately 150 interactive, game-like environments.
These environments will lack explicit instructions, requiring test-takers to deduce goals through action and feedback.
ARC-AGI 3 will only include games that the general public can solve, contrasting with benchmarks that increase difficulty to identify AI limitations in generalization.

True intelligence evaluation should consider metrics beyond just accuracy, such as the time and data needed to acquire skills.
Kamradt posits that factors like the amount of training data and energy consumption should also be included in evaluating intelligence.
ARC-AGI 3 will measure efficiency by comparing the number of actions AI takes in turn-based video game environments against average human performance.
Achieving 100% on ARC-AGI benchmarks would be strong evidence of generalization but would not be sufficient proof of AGI, emphasizing the need for continued research.