Quit Emailing Yourself

Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text

2 min read | Saved February 14, 2026 | Copied!

reinforcement-learning 🤖 dataset 🤖 language-models 🤖 task-synthesis 🤖 cybersecurity 🤖

Do you care about this?

The article presents Golden Goose, a method to create unlimited Reinforcement Learning with Verifiable Rewards (RLVR) tasks by using unverifiable internet text. It describes how the authors developed a large-scale dataset, GooseReason-0.7M, which includes over 700,000 tasks across various domains. The approach successfully enhances model performance, even in areas like cybersecurity where prior data was unavailable.

If you do, here's more

The paper introduces Golden Goose, a method designed to synthesize an unlimited number of Reinforcement Learning with Verifiable Rewards (RLVR) tasks from unverifiable text found on the internet. The challenge in reinforcement learning has been the limited availability of verifiable data, which hampers model performance as training continues. Golden Goose addresses this by transforming source text into multiple-choice questions, identifying key reasoning steps, and generating plausible distractors. This approach allows researchers to tap into rich, reasoning-heavy texts that were previously deemed unusable for RLVR data, such as science textbooks.

From this process, the authors created GooseReason-0.7M, a dataset containing over 700,000 tasks across mathematics, programming, and general science. They demonstrate that training language models on this dataset can effectively revive performance in models that have reached saturation with existing RLVR data. Notably, the paper reports state-of-the-art results for 1.5B and 4B-Instruct models across 15 benchmarks, showcasing the method's effectiveness. 

Furthermore, the authors applied Golden Goose in a practical setting by generating RLVR tasks from FineWeb scrapes related to cybersecurity, a field where prior data was nonexistent. Training the Qwen3-4B-Instruct model on this new dataset, named GooseReason-Cyber, resulted in superior performance compared to a 7B domain-specific model, reinforcing the potential of this approach to scale RLVR data using readily available internet text.

Questions about this article

No questions yet.