
Understanding AI Benchmarks: What Those Scores Actually Mean
AI companies love to boast about benchmark scores, but what do MMLU, HumanEval, and GSM8K actually measure? A researcher's guide to interpreting AI performance claims.
Further Reading
Why this matters: Companies use benchmarks to market their models. Understanding what these numbers mean helps you make better decisions about which AI to use.
The Problem with AI Benchmarks
Every few months, a new AI model launches with claims of "state-of-the-art performance" on various benchmarks. GPT-4 beats GPT-3.5. Claude 3.5 Sonnet outperforms Claude 3 Opus. Gemini Ultra tops the charts.
But here's what the press releases don't emphasise: benchmarks measure specific, narrow tasks under controlled conditions. Real-world usefulness is far more complex.
This isn't to say benchmarks are useless—they're not. But understanding their limitations helps you interpret claims more critically and choose the right model for your actual needs.
Common AI Benchmarks Explained
Let's break down the benchmarks you'll see most often in AI announcements:
MMLU (Massive Multitask Language Understanding)
What it measures: Knowledge across 57 academic subjects, from elementary maths to professional law and medicine.
Format: Multiple-choice questions with four options.
What it reveals: Breadth of factual knowledge and basic reasoning within academic domains.
Limitations
Multiple-choice format is easier than open-ended questions. Memorisation can inflate scores. Doesn't test practical application of knowledge.
HumanEval
What it measures: Code generation ability—specifically, whether the model can write Python functions that pass unit tests.
Format: 164 programming problems with test cases.
What it reveals: Basic programming competence and ability to understand problem specifications.
Limitations
Problems are self-contained and relatively simple. Doesn't test understanding of large codebases, debugging, or architectural decisions. Python-only.
GSM8K (Grade School Math)
What it measures: Mathematical reasoning through word problems at a grade school level.
Format: 8,500 word problems requiring multi-step reasoning.
What it reveals: Basic arithmetic reasoning and ability to translate language into mathematical operations.
Limitations
Problems follow predictable patterns. High scores don't indicate advanced mathematical reasoning. Models can pattern-match without true understanding.
HellaSwag
What it measures: Common-sense reasoning about everyday situations and activities.
Format: Sentence completion where the model chooses the most plausible continuation.
What it reveals: Understanding of typical sequences of events in human activities.
Limitations
Adversarial examples are generated by older models, so newer models find it increasingly easy. Near-ceiling performance makes it less useful for comparing frontier models.
GPQA (Graduate-Level Google-Proof Q&A)
What it measures: Expert-level knowledge in biology, physics, and chemistry.
Format: Questions designed to be difficult even for domain experts without their specific specialisation.
What it reveals: Deep domain expertise beyond what's easily searchable.
Limitations
Small dataset (448 questions). Narrow domain coverage. Expert-level doesn't mean practically useful.
Why Benchmark Scores Can Be Misleading
Beyond individual benchmark limitations, there are systemic issues with how benchmarks are used:
1. Training Data Contamination
Models trained on internet-scale data may have seen benchmark questions during training. This isn't cheating per se, but it means the benchmark is testing memorisation rather than reasoning ability.
Some labs actively test for contamination; others don't disclose their methodology. When a model performs suspiciously well on a specific benchmark, contamination is worth considering.
2. Benchmark Gaming
When benchmarks become targets, they stop being good measures. Labs can optimise specifically for benchmark performance through:
- Fine-tuning on similar question formats
- Prompt engineering specifically for benchmark tasks
- Selecting which benchmarks to report based on favourable results
- Choosing evaluation parameters (temperature, number of attempts) that maximise scores
3. The Gap Between Tests and Real Tasks
Consider what benchmarks don't measure:
- Consistency: Does the model give the same answer to the same question?
- Instruction following: Does it actually do what you ask, or what it thinks you should ask?
- Graceful failure: Does it admit uncertainty, or confidently hallucinate?
- Context handling: How well does it use information from long documents?
- Real-world complexity: How does it handle ambiguous, messy, incomplete information?
A model can score highly on MMLU while being frustrating to use in practice. Conversely, a model with lower benchmark scores might be more helpful for your specific use case.
Better Ways to Evaluate AI Models
If benchmarks are imperfect, how should you evaluate which AI to use?
1. Test on Your Actual Tasks
The best evaluation is running models on the work you actually need done. Create a set of representative tasks from your real workflow:
- Sample emails you need to write
- Code problems you've solved before
- Documents you need to summarise
- Questions your users actually ask
Compare how different models handle these specific cases. Your personal benchmark will be more relevant than any public leaderboard.
2. Look for Human Preference Evaluations
Some evaluations use human judges to compare model outputs. These are messier than automated benchmarks but often more predictive of real-world usefulness.
The LMSYS Chatbot Arena, for instance, has humans compare responses from different models without knowing which is which. These "blind taste tests" can reveal preference patterns that benchmarks miss.
3. Read Qualitative Reviews
Benchmark scores tell you a model got 87% on HumanEval. They don't tell you that it refuses to help with certain tasks, generates verbose boilerplate, or handles edge cases poorly.
Detailed reviews from people using models for real work often provide insights that numbers can't capture.
4. Consider the Full Picture
Beyond raw capability, practical factors matter:
- Speed: A slightly less capable model that responds in 2 seconds might be more useful than a better one that takes 30 seconds
- Cost: The best model at 10x the price isn't always the right choice
- Availability: Rate limits, uptime, and API reliability affect real-world usefulness
- Privacy: Where your data goes matters for many use cases
A Framework for Interpreting AI Announcements
When you see claims about a new model's benchmark performance, ask these questions:
What exactly was measured?
Look past the headline number to understand the specific benchmark and what it tests.
What were the evaluation conditions?
Temperature settings, number of attempts, prompt format—these all affect scores.
Who ran the evaluation?
Self-reported scores from the model creator deserve more scrutiny than independent evaluations.
What benchmarks weren't reported?
Selective reporting is common. If a model excels at one benchmark but others aren't mentioned, there may be a reason.
How does this translate to my use case?
A 5% improvement on a coding benchmark might not matter if you primarily use AI for writing.
The Bottom Line
Benchmarks serve a purpose: they provide standardised comparisons and track progress over time. But they're not the full picture, and treating them as definitive measures of "which AI is best" leads to poor decisions.
The most useful AI for you isn't necessarily the one with the highest benchmark scores. It's the one that best handles your specific tasks, fits your budget, and integrates into your workflow.
Use benchmarks as one input among many. Combine them with hands-on testing, qualitative reviews, and practical considerations. And when a company leads with benchmark scores in their marketing, remember: those numbers were chosen specifically because they make the product look good.
Key Takeaways
- Benchmarks measure narrow tasks—real-world usefulness is more complex
- Scores can be gamed—through contamination, optimisation, or selective reporting
- Your tasks matter most—test models on your actual work, not abstract problems
- Consider the full picture—speed, cost, and reliability often matter more than benchmark deltas
Stay Updated on AI
Get the latest news and tutorials