Overview
- The OpenLLM Leaderboard has evolved from an internal research project to a community-driven platform evaluating over 7,400 models, providing transparent, independent assessment that contrasts with previous non-reproducible self-reported evaluations.
- Current benchmarks face significant limitations including dataset quality issues, contamination concerns, and rapid obsolescence as models improve, with many reaching "saturation" scores that suggest gaming rather than true capability improvements.
- Human evaluations exhibit inherent biases, with users preferring models that are assertive but potentially incorrect or that agree with their existing beliefs, highlighting the need for more diverse and multi-turn evaluation approaches.
- Evaluation methodology details matter significantly—prompt structure can cause 30-point performance variations, and computational constraints (up to $100/hour for H100 nodes) force strategic benchmark selection.
- Future leaderboard developments will likely focus on reasoning capabilities, long context understanding, coding abilities, and model calibration (correlation between confidence and correctness), with an emphasis on models being truthful rather than merely confident.
Content: Hugging Face OpenLLM Leaderboard and Model Evaluation
Background and Professional Journey
- Clémentine Fourier is a research scientist at Hugging Face and maintainer of the OpenLM leaderboard
- Originally trained as a geologist (graduated in 2015) before transitioning to computer science
- Completed her PhD at Inria with funding from the organization
- Professional experience includes:
- Draws interesting parallels between geology and machine learning as experimental sciences
- Appreciates the vast time scales in geological research, seeing human existence as a "significant blink" in Earth's long history
Hugging Face OpenLLM Leaderboard Evolution
- Started as an internal research project by the reinforcement learning team to compare published paper results
- Quickly gained community engagement and momentum
- Currently evaluates 7,400 community-submitted models
- Has received around 800 discussion threads and several million visitors
- Some startups have credited leaderboard rankings with helping secure funding rounds
- Represents a shift from non-reproducible, self-reported model evaluations to transparent, independent assessment
Benchmark Selection Process
- V1 benchmarks (GSM 8K, MMLU, ARC Challenge) chosen based on standard metrics in research papers at the time
- V1.5 involved community interaction to identify missing evaluation capabilities
- Ongoing collaboration with reinforcement learning teams to refine benchmark selection
- Iterative improvement based on community feedback
- Recent increased scrutiny of benchmark limitations as model performance improves
Dataset and Benchmark Quality Issues
- Many early AI datasets were created through "turking" - using underpaid workers, often with non-native English skills
- When benchmarks reach "saturation" (human-level or above performance), it often indicates model contamination rather than true capability
- MMLU benchmark shows models achieving high 80s scores, suggesting the test is becoming less challenging
- Benchmarks quickly become outdated due to rapid AI progress
- Leaderboards drive performance improvements by motivating researchers to climb scores
Evaluation Methods and Challenges
- Three types of human evaluations identified:
- Performance benchmarking framework:
- Using large language models like GPT-4 for evaluation is not recommended due to:
- Recommended evaluation approaches:
Human Evaluation Limitations
- Wisdom of the crowd approaches work best for quantifiable tasks
- Human feedback has significant inherent biases:
- Most current evaluations (like chatbot arenas) are:
- Relying solely on human evaluation can lead to models that are psychophantic rather than factually accurate
Notable Benchmarks and Evaluation Resources
- MMLU Pro: Top headline benchmark with 10 choices instead of 4, expert-reviewed
- GPQA: PhD-level questions in scientific domains, written by experts
- IFEval: Unique benchmark focused on instruction following
- Other resources mentioned:
Evaluation Methodology Details
- Prompt structure significantly impacts model performance (up to 30-point variation)
- Most complex format (choices enumerated with letters in parentheses) performs best in MMLU
- Log likelihood method used for evaluation instead of generative approach due to:
- Computational constraints are significant:
Hardest Benchmarks
- Math benchmark (level 5 questions) - most challenging
- MUSR (Multi-Step Soft Reasoning) - difficult due to long context and complex reasoning
Leaderboard Implementation Details
- Hugging Face team spent about a month carefully evaluating models
- Focused on ensuring fair, accurate, and stable evaluations
- Thoroughly checked implementation details like tokenization, formatting, and token handling
- Removed evaluations with implementation errors (e.g., V1 drop evaluation)
- Uses Hugging Face's research cluster with lowest priority for leaderboard jobs
- Currently no direct community compute donation mechanism
Future Plans for the Leaderboard
- Considering adding an option for community to run evaluations on inference endpoints
- Want to integrate with Eleusora AI harness
- Have existing evaluation library (Lightable) with potential functionality
- Engineering challenges currently prevent immediate implementation
- Committed to scientific objectivity and transparency
Long Context and Agent Benchmarks
- Long context benchmarks highlighted:
- Critique of existing agent benchmarks as too artificial
- Gaia benchmark developed as a more realistic approach to testing AI agent capabilities:
Benchmark Quality Assessment ("Vibe Check")
- Key evaluation criteria include:
- GSM 8K is praised for constrained output format
- Drop benchmark criticized for using bag-of-words metric
Model Calibration
- Defined as correlation between log probability and answer correctness
- Represents a model's "self-confidence"
- Would enable providing confidence intervals for model responses
- Base models are currently better calibrated than instruction-tuned models
- Identified as an important area for future benchmarks
Predictions for Next Leaderboard (v3)
- Focus areas likely to include:
- Models expected to quickly improve in areas like instruction following
- Acknowledgment of potential contamination of certain evaluation datasets (e.g., GPQA)
- Emphasis on models being assertive about factual truths and avoiding creating "thought bubbles" for users