Overview

The OpenLLM Leaderboard has evolved from an internal research project to a community-driven platform evaluating over 7,400 models, providing transparent, independent assessment that contrasts with previous non-reproducible self-reported evaluations.

Current benchmarks face significant limitations including dataset quality issues, contamination concerns, and rapid obsolescence as models improve, with many reaching "saturation" scores that suggest gaming rather than true capability improvements.

Human evaluations exhibit inherent biases, with users preferring models that are assertive but potentially incorrect or that agree with their existing beliefs, highlighting the need for more diverse and multi-turn evaluation approaches.

Evaluation methodology details matter significantly—prompt structure can cause 30-point performance variations, and computational constraints (up to $100/hour for H100 nodes) force strategic benchmark selection.

Future leaderboard developments will likely focus on reasoning capabilities, long context understanding, coding abilities, and model calibration (correlation between confidence and correctness), with an emphasis on models being truthful rather than merely confident.

Content: Hugging Face OpenLLM Leaderboard and Model Evaluation

Background and Professional Journey

Clémentine Fourier is a research scientist at Hugging Face and maintainer of the OpenLM leaderboard
Originally trained as a geologist (graduated in 2015) before transitioning to computer science
Completed her PhD at Inria with funding from the organization
Professional experience includes:

- Working on illness prediction in brain research - NLP research - Joining Hugging Face after being contacted by Meta - Initially working on pre-trained graph transformer models before shifting to model evaluation

Draws interesting parallels between geology and machine learning as experimental sciences
Appreciates the vast time scales in geological research, seeing human existence as a "significant blink" in Earth's long history

Hugging Face OpenLLM Leaderboard Evolution

Started as an internal research project by the reinforcement learning team to compare published paper results
Quickly gained community engagement and momentum
Currently evaluates 7,400 community-submitted models
Has received around 800 discussion threads and several million visitors
Some startups have credited leaderboard rankings with helping secure funding rounds
Represents a shift from non-reproducible, self-reported model evaluations to transparent, independent assessment

Benchmark Selection Process

V1 benchmarks (GSM 8K, MMLU, ARC Challenge) chosen based on standard metrics in research papers at the time
V1.5 involved community interaction to identify missing evaluation capabilities
Ongoing collaboration with reinforcement learning teams to refine benchmark selection
Iterative improvement based on community feedback
Recent increased scrutiny of benchmark limitations as model performance improves

Dataset and Benchmark Quality Issues

Many early AI datasets were created through "turking" - using underpaid workers, often with non-native English skills
When benchmarks reach "saturation" (human-level or above performance), it often indicates model contamination rather than true capability
MMLU benchmark shows models achieving high 80s scores, suggesting the test is becoming less challenging
Benchmarks quickly become outdated due to rapid AI progress
Leaderboards drive performance improvements by motivating researchers to climb scores

Evaluation Methods and Challenges

Three types of human evaluations identified:

1. "Vibe check" evaluations 2. Human-based ratings (like LMSys chatbot arena) 3. Paid human expert evaluations

Performance benchmarking framework:

- 25% - Random chance - 50% - Average human performance - 75% - Expert human performance - 90% - Potential cheating/gaming the system

Using large language models like GPT-4 for evaluation is not recommended due to:

- Closed source nature - Lack of reproducibility - Tendency towards verbose and biased responses

Recommended evaluation approaches:

- Use smaller models like Prometheus or Judge LM for rankings - Avoid asking models for precise numerical scores - Conduct personal "Vibe checks" for specific use cases

Human Evaluation Limitations

Wisdom of the crowd approaches work best for quantifiable tasks
Human feedback has significant inherent biases:

- Preference for models that agree with them (Anthropic's psychophancy paper) - Favoring assertive (but potentially false) answers (COHEAR/University of Edinburgh research) - Limited demographic diversity of annotators

Most current evaluations (like chatbot arenas) are:

- Single-turn interactions - Lack diversity - Do not adequately test multi-turn capabilities

Relying solely on human evaluation can lead to models that are psychophantic rather than factually accurate

Notable Benchmarks and Evaluation Resources

MMLU Pro: Top headline benchmark with 10 choices instead of 4, expert-reviewed
GPQA: PhD-level questions in scientific domains, written by experts
IFEval: Unique benchmark focused on instruction following

- Described as "unit tests, but for language" - Evaluates precise instruction understanding - Uses strict instruction formatting

Other resources mentioned:

- Open Router for trending bots - Ravenwolf on Hugging Face, known for detailed LLM evaluation threads

Evaluation Methodology Details

Prompt structure significantly impacts model performance (up to 30-point variation)
Most complex format (choices enumerated with letters in parentheses) performs best in MMLU
Log likelihood method used for evaluation instead of generative approach due to:

- Lower computational cost - Easier parallelization

Computational constraints are significant:

- H100 nodes cost around $100/hour - Evaluating a 7b model can take up to 20 hours - Compute limitations force strategic benchmark selection

Hardest Benchmarks

Math benchmark (level 5 questions) - most challenging
MUSR (Multi-Step Soft Reasoning) - difficult due to long context and complex reasoning

- Murder mystery scenarios - Few models perform better than random

Leaderboard Implementation Details

Hugging Face team spent about a month carefully evaluating models
Focused on ensuring fair, accurate, and stable evaluations
Thoroughly checked implementation details like tokenization, formatting, and token handling
Removed evaluations with implementation errors (e.g., V1 drop evaluation)
Uses Hugging Face's research cluster with lowest priority for leaderboard jobs
Currently no direct community compute donation mechanism

Future Plans for the Leaderboard

Considering adding an option for community to run evaluations on inference endpoints
Want to integrate with Eleusora AI harness
Have existing evaluation library (Lightable) with potential functionality
Engineering challenges currently prevent immediate implementation
Committed to scientific objectivity and transparency

Long Context and Agent Benchmarks

Long context benchmarks highlighted:

- A linguistics benchmark using a grammar book for a low-resource language (Calamang) - An Allen AI benchmark using recent novels, with questions requiring full book comprehension

Critique of existing agent benchmarks as too artificial
Gaia benchmark developed as a more realistic approach to testing AI agent capabilities:

- Tests models on real-world tasks - Uses multiple tools and reasoning across different modalities - Provides a replicable methodology for creating similar benchmarks

Benchmark Quality Assessment ("Vibe Check")

Key evaluation criteria include:

- Data set origin (human vs. model generated) - Annotator quality and compensation - Underlying assumptions of the dataset - Prompt quality and consistency - Testing across different model sizes - Examining generation quality - Assessing evaluation metrics

GSM 8K is praised for constrained output format
Drop benchmark criticized for using bag-of-words metric

Model Calibration

Defined as correlation between log probability and answer correctness
Represents a model's "self-confidence"
Would enable providing confidence intervals for model responses
Base models are currently better calibrated than instruction-tuned models
Identified as an important area for future benchmarks

Predictions for Next Leaderboard (v3)

Focus areas likely to include:

- Reasoning capabilities - Mass evaluations - Long context understanding - Coding abilities - Potential psychofancy evaluation

Models expected to quickly improve in areas like instruction following
Acknowledgment of potential contamination of certain evaluation datasets (e.g., GPQA)
Emphasis on models being assertive about factual truths and avoiding creating "thought bubbles" for users

Benchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judge