Click any tag below to further narrow down your results
Links
The article critiques the METR plot, which measures task completion times for AI models, highlighting its reliance on only 14 samples in the 1-4 hour range. The author argues that using such a limited dataset to draw conclusions about AI progress and safety timelines is misleading and calls for more robust metrics.