Researchers at Ai2 propose a method for evaluating language models by measuring the signal-to-noise ratio (SNR) of benchmarks. They demonstrate that higher SNR in benchmarks leads to more reliable model evaluations and suggest interventions to enhance benchmark quality, ultimately improving decision-making in language model training and scaling predictions. A dataset of 900K evaluation results on 465 models is also released to support further research in evaluation methodologies.
signal-noise ✓
language-models ✓
evaluation ✓
benchmarks ✓
decision-making ✓