TimeScope is an open-source benchmark that evaluates the understanding of long videos by vision-language models through localized retrieval, information synthesis, and fine-grained temporal perception. By inserting short video clips into longer videos, it challenges models to demonstrate true temporal comprehension rather than surface-level recognition, revealing that many state-of-the-art models struggle with these tasks. The benchmark aims to drive improvements in how multimodal systems are trained and evaluated.
+ video-benchmark
multimodal-ai ✓
temporal-comprehension ✓
vision-language ✓
open-source ✓