Google has launched two new models in the Gemini family, Gemini 2.5 Pro and Gemini 2.5 Flash, which significantly enhance video understanding capabilities. The Pro model achieves state-of-the-art performance in various benchmarks and enables innovative applications like interactive learning tools and dynamic animations from video content. Both models facilitate advanced video processing and offer cost-effective solutions for diverse use cases in education and content creation.
VistaDPO is a new framework for optimizing video understanding in Large Video Models (LVMs) by aligning text-video preferences at three hierarchical levels: instance, temporal, and perceptive. The authors introduce a dataset, VistaDPO-7k, consisting of 7.2K annotated QA pairs to address the challenges of video-language misalignment and hallucinations, showing significant performance improvements in various benchmarks.