Reinforcement Learning (RL) techniques, particularly the Group Relative Policy Optimization (GRPO) algorithm, have been utilized to significantly improve the mathematical reasoning capabilities of language models. The study highlights how proper infrastructure, data diversity, and effective training practices can enhance performance, while also addressing challenges like model collapse and advantage estimation bias.
Liger enhances TRL’s Group Relative Policy Optimization (GRPO) by reducing memory consumption by 40% during training without sacrificing model quality. The integration also introduces support for Fully Sharded Data Parallel (FSDP) and Parameter-Efficient Fine-Tuning (PEFT), facilitating scalable training across multiple GPUs. Additionally, Liger Loss can be paired with vLLM for accelerated text generation during training.