Quit Emailing Yourself

How to game the METR plot

6 min read | Saved February 14, 2026 | Copied!

metr 🤖 ai-safety 🤖 horizon-length 🤖 evaluation 🤖 metrics 🤖

Do you care about this?

The article critiques the METR plot, which measures task completion times for AI models, highlighting its reliance on only 14 samples in the 1-4 hour range. The author argues that using such a limited dataset to draw conclusions about AI progress and safety timelines is misleading and calls for more robust metrics.

If you do, here's more

The METR plot, which measures the horizon length of tasks AI models can complete in estimated human hours, has become a focal point for discussions around AI progress. In March 2025, when the original METR paper was released, only 14 samples fell into the 1-4 hour range, leading to potentially misleading conclusions about the capabilities of AI models like Claude 3.7 Sonnet. This model had a horizon length estimate of 59 minutes but showed a 0% success rate on tasks between 2 to 4 hours. The author raises concerns about drawing significant inferences regarding AGI timelines and research priorities based on such a small dataset.

The METR plot's methodology relies on a logistic curve to estimate success probabilities, which the author finds problematic. The assumption of a logistic function may skew results, especially since the model's success rates tend to vary by task length. The author points out that if a model improves on specific tasks, it can create a log-linear trend in horizon length estimates, making it easy to misinterpret the data. With only 14 samples in the critical range, even small improvements can drastically alter the perceived horizon length.

To further complicate matters, many tasks in the METR range come from HCAST, which allows for targeted training to improve performance. This suggests that labs could manipulate results without necessarily intending to game the system. The author advocates for better and more comprehensive measurements of horizon lengths, emphasizing that while the concept is valuable, the current interpretation and reliance on the METR plot are misaligned with the data's limitations. The overall analysis calls for a more nuanced understanding of what these horizon lengths truly represent in the context of AI development.

Questions about this article

No questions yet.