Quit Emailing Yourself

Thinking through how pretraining vs RL learn

7 min read | Saved February 14, 2026 | Copied!

reinforcement-learning 🤖 supervised-learning 🤖 training-efficiency 🤖 computational-cost 🤖 information-density 🤖

Do you care about this?

The article compares the learning efficiency of reinforcement learning (RL) and supervised learning, highlighting that RL requires significantly more computational effort to obtain meaningful feedback. It discusses how the quality of information per sample is generally lower in RL, especially early in training, leading to noisy gradient estimates and less efficient learning. The author emphasizes the importance of maintaining an optimal pass rate to improve RL performance.

If you do, here's more

Reinforcement learning (RL) is far less efficient than supervised learning in terms of data utilization. In supervised learning, each token contributes significantly to the model's understanding, providing clear feedback on mistakes. In contrast, RL requires a complete trajectory of decisions to yield a single reward signal, making it harder to learn from each sample. The article introduces a formula, Bits/FLOP = Samples/FLOP * Bits/Sample, to compare the two learning methods and highlights that the information density per sample in RL is much lower, especially early in training.

A random model in supervised learning quickly learns from its errors by adjusting probabilities based on correct answers. In RL, when the model is untrained, it often fails to produce correct answers, leading to inefficient learning. The author explains how the pass rate affects the information gained from each sample. In supervised learning, as the model's uncertainty about the answer increases, the learning from the correct label becomes more significant. For RL, the most learning occurs when the pass rate is around 50%, but achieving that rate early in training is challenging.

The article also discusses variance in training. In RL, early training produces noisy gradient estimates, making learning unpredictable. Conversely, in supervised learning, variance increases as training progresses, resulting in diminishing returns as the model exhausts learnable information. The author suggests strategies for improving RL efficiency, such as curriculum learning that aligns task difficulty with model capability and using proxies to create denser feedback, which could enhance the learning process.

Questions about this article

No questions yet.