Quit Emailing Yourself

GitHub - THUDM/CaRR: This repository contains the code and data for the paper "Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards".

2 min read | Saved February 14, 2026 | Copied!

reinforcement-learning 🤖 deep-search 🤖 rubric-rewards 🤖 evidence-grounding 🤖 policy-optimization 🤖

Do you care about this?

This article presents a new framework called Citation-aware Rubric Rewards (CaRR) to improve reinforcement learning for deep search agents. It addresses issues like shortcut exploitation and hallucinations by promoting comprehensive reasoning and evidence-based decision-making. The method outperforms traditional outcome-based approaches in various evaluations.

If you do, here's more

The article introduces a new framework called Citation-aware Rubric Rewards (CaRR) aimed at improving reinforcement learning (RL) for deep search agents. Traditional RL methods mainly use binary rewards based on whether the final answer is correct, which can lead to problems like shortcut exploitation and hallucinations. These issues arise when agents reach answers without a thorough understanding or by relying on incomplete information. CaRR addresses these shortcomings by implementing a fine-grained reward system that emphasizes reasoning quality, factual grounding, and evidence connectivity.

CaRR decomposes complex questions into smaller, checkable rubrics that agents must satisfy to receive rewards. These rubrics focus on three key aspects: identifying all relevant entities, ensuring statements are backed by cited content, and forming a clear evidence chain that leads to the final answer. Alongside CaRR, the article also presents Citation-aware Group Relative Policy Optimization (C-GRPO), which augments the existing Group Relative Policy Optimization by incorporating weighted rubric rewards. This dual approach promotes more comprehensive reasoning and improved accuracy in model training.

Experiments using the Qwen3-4B-Thinking-2507 and Qwen3-30B-A3B-Thinking-2507 models, trained on the DeepDive dataset, show that C-GRPO consistently outperforms standard outcome-based GRPO on various deep search benchmarks. C-GRPO agents can utilize longer context budgets effectively, enhancing their performance in open-ended research tasks. The integration of APIs like Serper and Jina for web access further supports the robustness of these agents in real-world applications.

Questions about this article

No questions yet.