2 links tagged with all of: reinforcement-learning + vision-language
Click any tag below to further narrow down your results
Links
This article presents a new approach for predicting image locations on Earth by integrating map-based reasoning into large vision-language models. It develops a two-stage optimization method that combines reinforcement learning with test-time scaling to enhance prediction accuracy. The authors introduce MAPBench, a benchmark for evaluating geolocalization performance on real-world images.
Vision-Zero is a novel framework that enhances vision-language models (VLMs) through competitive visual games without requiring human-labeled data. It achieves state-of-the-art performance in various reasoning tasks, demonstrating that self-play can effectively improve model capabilities while significantly reducing training costs. The framework supports diverse datasets, including synthetic, chart-based, and real-world images, showcasing its versatility and effectiveness in fine-grained visual reasoning tasks.