OpenAI and

Latent Space: The AI Engineer Podcast

Benchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judge

Overview

Content: Hugging Face OpenLLM Leaderboard and Model Evaluation

Background and Professional Journey

- Working on illness prediction in brain research - NLP research - Joining Hugging Face after being contacted by Meta - Initially working on pre-trained graph transformer models before shifting to model evaluation

Hugging Face OpenLLM Leaderboard Evolution

Benchmark Selection Process

Dataset and Benchmark Quality Issues

Evaluation Methods and Challenges

1. "Vibe check" evaluations 2. Human-based ratings (like LMSys chatbot arena) 3. Paid human expert evaluations - 25% - Random chance - 50% - Average human performance - 75% - Expert human performance - 90% - Potential cheating/gaming the system - Closed source nature - Lack of reproducibility - Tendency towards verbose and biased responses - Use smaller models like Prometheus or Judge LM for rankings - Avoid asking models for precise numerical scores - Conduct personal "Vibe checks" for specific use cases

Human Evaluation Limitations

- Preference for models that agree with them (Anthropic's psychophancy paper) - Favoring assertive (but potentially false) answers (COHEAR/University of Edinburgh research) - Limited demographic diversity of annotators - Single-turn interactions - Lack diversity - Do not adequately test multi-turn capabilities

Notable Benchmarks and Evaluation Resources

- Described as "unit tests, but for language" - Evaluates precise instruction understanding - Uses strict instruction formatting - Open Router for trending bots - Ravenwolf on Hugging Face, known for detailed LLM evaluation threads

Evaluation Methodology Details

- Lower computational cost - Easier parallelization - H100 nodes cost around $100/hour - Evaluating a 7b model can take up to 20 hours - Compute limitations force strategic benchmark selection

Hardest Benchmarks

- Murder mystery scenarios - Few models perform better than random

Leaderboard Implementation Details

Future Plans for the Leaderboard

Long Context and Agent Benchmarks

- A linguistics benchmark using a grammar book for a low-resource language (Calamang) - An Allen AI benchmark using recent novels, with questions requiring full book comprehension - Tests models on real-world tasks - Uses multiple tools and reasoning across different modalities - Provides a replicable methodology for creating similar benchmarks

Benchmark Quality Assessment ("Vibe Check")

- Data set origin (human vs. model generated) - Annotator quality and compensation - Underlying assumptions of the dataset - Prompt quality and consistency - Testing across different model sizes - Examining generation quality - Assessing evaluation metrics

Model Calibration

Predictions for Next Leaderboard (v3)

- Reasoning capabilities - Mass evaluations - Long context understanding - Coding abilities - Potential psychofancy evaluation

More from Latent Space: The AI Engineer Podcast

Explore all episode briefs from this podcast

View All Episodes →

Listen smarter with PodBrief

Get AI-powered briefs for all your favorite podcasts, plus a daily feed that keeps you informed.

Download on the App Store