Resolution Diagnostics for Paired LLM Evaluation
Why AI leaderboard rankings often lack statistical proof
Many AI model comparisons published on major leaderboards don't have enough test data to confidently declare one model better than another. The paper shows that on the Open LLM Leaderboard, 11 of 40 pairwise rankings and on MMLU-Pro, 4 to 6 of 9 top-tier comparisons fail to meet standard statistical certainty thresholds — and a widely-used calculation method used to estimate required test size can be off by a factor of two in close races.
When researchers or companies choose which AI model to deploy, they often rely on these published leaderboards as proof that one model outperforms another. Unresolved comparisons mean those rankings may reflect noise rather than genuine performance differences, potentially leading to costly or misguided adoption decisions. The calculation error identified here affects how many test cases are needed to prove differences are real, so fixing it could prevent false claims from appearing on leaderboards in the first place.