Power sampling from the base model achieves performance comparable to or surpassing RL-posttraining across various reasoning tasks, including MATH500, HumanEval, and GPQA Diamond. Notably, in-domain results for MATH500 are nearly equal to GRPO, while out-of-domain outcomes, particularly on HumanEval and AlpacaEval 2.0, show power sampling outperforming GRPO without altering the base model's weights.
power-sampling ✓
+ rl-posttraining
reasoning-tasks ✓
benchmarks ✓
model-performance ✓