Quit Emailing Yourself

Benchmark Scores = General Capability + Claudiness

5 min read | Saved February 14, 2026 | Copied!

benchmarks 🤖 general-capability 🤖 claudiness 🤖 ai-models 🤖 performance 🤖

Do you care about this?

This article analyzes how benchmark scores for AI models often reflect a single dimension of "general capability." It discusses the implications of this finding, particularly the contrasting ideas of whether model performance is based on a deep underlying ability or if it is contingent on specific skills. The author also introduces the concept of "Claudiness," which reveals limitations in certain model capabilities.

If you do, here's more

Benchmark scores for AI models reveal a significant underlying dimension labeled “General Capability,” which primarily drives performance across various tasks. The recent Gemini 3 release highlighted this with a table showing state-of-the-art results across nineteen benchmarks. However, as models improve on multiple benchmarks simultaneously, the dataset often reflects a single dimension rather than distinct, unrelated skills. The Epoch Capabilities Index (ECI) analysis indicates that about half of the variance in benchmark scores can be explained by this general capability component.

Beyond this primary dimension is a second component referred to as “Claudiness,” which suggests a divergence in model performance. This component indicates that while models excel at agentic tasks, they struggle in areas like advanced math or vision tasks. The author suggests that this may point to a more contingent model development landscape, where achieving high performance across various capabilities requires targeted efforts rather than relying on a single underlying ability. The presence of Claudiness adds nuance to the debate about whether AI capabilities are fundamentally interconnected or if they require distinct, specialized development efforts.

The discussion raises a crucial question about the future of model development: can AI continue to improve across all benchmarks simultaneously? While the current trend shows that developers have the resources and architecture to enhance general performance, sustained improvement may depend on ongoing investment and strategic focus. The potential for AI models to excel in multiple domains remains, but it also suggests that each improvement will come at a cost, challenging the assumption that generalization will always yield easy gains across diverse tasks.

Questions about this article

No questions yet.