Vision-Zero is a novel framework that enhances vision-language models (VLMs) through competitive visual games without requiring human-labeled data. It achieves state-of-the-art performance in various reasoning tasks, demonstrating that self-play can effectively improve model capabilities while significantly reducing training costs. The framework supports diverse datasets, including synthetic, chart-based, and real-world images, showcasing its versatility and effectiveness in fine-grained visual reasoning tasks.
vision-language ✓
+ self-play
reinforcement-learning ✓
model-training ✓
gamification ✓