1 link tagged with all of: reinforcement-learning + reward-modeling
Click any tag below to further narrow down your results
Links
The paper explores the enhancement of reward modeling in reinforcement learning for large language models, focusing on inference-time scalability. It introduces Self-Principled Critique Tuning (SPCT) to improve generative reward modeling and proposes a meta reward model to optimize performance during inference. Empirical results demonstrate that SPCT significantly enhances the quality and scalability of reward models compared to existing methods.
reinforcement-learning ✓
reward-modeling ✓
+ large-language-models
+ inference-scaling
+ generative-models