2 min read
|
Saved February 14, 2026
|
Copied!
Do you care about this?
This article presents Agentic Rubrics, a method for verifying software engineering agents without executing code. By using a context-grounded checklist created by an expert agent, candidate patches are scored efficiently, providing a more interpretable alternative to traditional verification methods. The results show significant improvements in scoring compared to existing baselines.
If you do, here's more
Verification plays a key role in enhancing software engineering (SWE) agents, providing the necessary feedback for Reinforcement Learning and improving performance metrics through Test-Time Scaling (TTS). Traditional verification methods often depend on executing code, which can become cumbersome and difficult to manage at scale due to the setup required for different environments. While alternatives like patch classifiers and heuristic approaches exist, they tend to lack the contextual grounding in the codebase, making them harder to interpret.
The authors introduce Agentic Rubrics, a novel approach where an expert agent collaborates with the code repository to generate a checklist of criteria specific to the context. This allows for scoring candidate patches without executing tests. In their experiments on SWE-Bench, the method achieved scores of 54.2% for Qwen3-Coder-30B-A3B and 40.6% for Qwen3-32B, showing at least a 3.5 percentage-point improvement over the best baseline in their comparison. The analysis confirms that the rubric scores align with the results of ground-truth tests, while also identifying issues that standard tests might miss.
Further investigation into the rubric's behavior reveals that gathering contextual information is vital for establishing clear, codebase-specific criteria. The results indicate that Agentic Rubrics offer a more efficient, scalable, and detailed verification method for SWE agents, addressing some limitations of existing techniques.
Questions about this article
No questions yet.