1 min read
|
Saved February 14, 2026
|
Copied!
Do you care about this?
The article critiques the pass@k metric used to measure AI agents' success, arguing that it can create a misleadingly positive view of performance. It highlights that while pass@k may show high success rates through multiple attempts, real user experiences are often less forgiving. The author calls for more careful consideration and justification when using this metric in evaluating AI.
If you do, here's more
Marc Brooker, an engineer at Amazon Web Services, critiques the commonly used metric pass@k for evaluating AI agents. He explains that pass@k measures the likelihood of at least one success in multiple attempts. For example, rolling a six-sided die yields a 45% pass rate if you roll three times and an impressive 99.4% if you roll one hundred times. At first glance, these results seem promising, but Brooker argues that they can be misleading. A model might appear effective with a high pass@k score, while its actual success rate could be quite low, as in the case of the die, where meaningful success occurs only 5% of the time.
Brooker highlights a significant disconnect between how AI agents are evaluated and how users perceive their performance. Unlike the pass@k metric, which can look favorable with a high number of attempts, users expect consistent success in real interactions. They won't be satisfied if a system only succeeds once in ten tries; they'll view it as unreliable. He suggests that humans are exponentially unforgiving compared to the forgiving nature of the pass@k metric. This discrepancy raises concerns about the validity of pass@k in settings where multiple steps are involved.
While Brooker acknowledges that pass@k might work in specific scenarios—simple tasks with reliable evaluators and no human input—he argues it should be used sparingly and with justification. He emphasizes the need for honesty and rigor in evaluating AI agents, urging the field to adopt more accurate metrics that reflect true performance rather than inflated success rates.
Questions about this article
No questions yet.