Reinforcement Learning on Pre-Training Data (RLPT) introduces a new paradigm for scaling large language models (LLMs) by allowing the policy to autonomously explore meaningful trajectories from pre-training data without relying on human annotations for rewards. By adopting a next-segment reasoning objective, RLPT improves LLM capabilities, as demonstrated by significant performance gains on various reasoning benchmarks and encouraging broader context exploration for enhanced generalization.
reinforcement-learning ✓
pre-training ✓
language-models ✓
scaling ✓
+ reasoning