TRL has introduced co-located vLLM to improve the efficiency of training large language models by allowing both training and inference to run on the same GPUs, eliminating idle time and reducing hardware costs. This integration enhances throughput, simplifies deployment, and makes the system more robust for online learning setups like GRPO. The new approach is supported by a series of performance experiments demonstrating significant speedups compared to traditional server setups.
The repository provides an implementation of the method "Learning Compact Vision Tokens for Efficient Large Multimodal Models," which enhances inference efficiency by fusing spatial-adjacent vision tokens and introducing a Multi-Block Token Fusion module. Experimental results show that this approach achieves competitive performance on various vision-language benchmarks while using only 25% of the baseline vision tokens.