Quit Emailing Yourself

# evaluation → ai-safety

1 link tagged with all of: evaluation + ai-safety

Click any tag below to further narrow down your results

Links

How to game the METR plot

The article critiques the METR plot, which measures task completion times for AI models, highlighting its reliance on only 14 samples in the 1-4 hour range. The author argues that using such a limited dataset to draw conclusions about AI progress and safety timelines is misleading and calls for more robust metrics.

Saved by tldr-importer · Last saved February 14, 2026 · 6 min read

+ metr ai-safety ✓ + horizon-length evaluation ✓ + metrics