A ranking system from competitive chess applied to AI: models battle head-to-head on the same task and humans pick the winner, producing a relative skill score.
An Elo rating expresses relative skill from pairwise battles. Two models receive the same prompt, humans (or judges) pick the better output, and the winner takes points from the loser — more points when an underdog wins. Over thousands of battles, ratings converge to a stable skill ordering.
Elo-based leaderboards (like Design Arena for UI/frontend generation, or chatbot arenas for conversation) capture something static benchmarks miss: real human preference on open-ended work, where there is no single correct answer.
Read Elo gaps probabilistically: a 100-point gap means the higher-rated model wins about 64% of battles; 200 points, about 76%. Small gaps (under ~30 points) are effectively ties.