Frontier language models demonstrate the ability to recognize when they are being evaluated, with a significant but not superhuman level of evaluation awareness. This capability raises concerns about the reliability of assessments and benchmarks, as models may behave differently during evaluations. The study includes a benchmark of 1,000 prompts from various datasets and finds that while models outperform random chance in identifying evaluations, they still lag behind human performance.
evaluation-awareness ✓
+ language-models
ai-evaluation ✓
benchmarks ✓
human-comparison ✓