Quit Emailing Yourself

Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification

2 min read | Saved February 14, 2026 | Copied!

deep-learning 🤖 verification 🤖 self-evolution 🤖 feedback 🤖 datasets 🤖

Do you care about this?

This article introduces a method for improving Deep Research Agents (DRAs) by using a feedback system during inference. The authors present DeepVerifier, a tool that assesses the agents' outputs against detailed rubrics to enhance their performance without additional training. They also offer a dataset to aid in the development of verification capabilities for open-source models.

If you do, here's more

Recent developments in Deep Research Agents (DRAs) are changing how automated systems discover knowledge and solve problems. Instead of only improving the capabilities of these agents after training, the authors propose a method where agents evolve their abilities in real-time. This involves a process of verifying the outputs generated by the agents, using detailed rubrics to evaluate their performance. The approach allows agents to refine their responses based on feedback without needing additional training sessions.

The authors introduce DeepVerifier, a verification tool that uses a rubric-based system to assess agent outputs. This tool outperforms existing methods, such as agent-as-judge and LLM judge baselines, by 12% to 48% in F1 score during meta-evaluation. DeepVerifier operates as a plug-and-play module during test-time inference, providing specific feedback that agents can use to improve their answers iteratively. The method has led to accuracy gains of 8% to 11% on difficult subsets of two datasets: GAIA and XBench-DeepResearch.

To support further research and development, the authors have released DeepVerifier-4K, a dataset containing 4,646 carefully selected examples focused on DRA verification. This dataset emphasizes the importance of self-assessment and critique, enabling open-source models to enhance their verification capabilities. By classifying agent failures into five main categories and thirteen sub-categories, the authors provide a structured approach to understanding and improving DRA performance.

Questions about this article

No questions yet.