SpatialScore introduces a comprehensive benchmark for evaluating multimodal large language models (MLLMs) in spatial understanding, consisting of the VGBench dataset and an extensive collection of 28K samples. It features the SpatialAgent, a multi-agent system designed for enhanced spatial reasoning, and reveals persistent challenges and improvements in spatial tasks through quantitative and qualitative evaluations.
Daily-Omni is introduced as a new benchmark for audio-visual reasoning, featuring 684 videos and 1197 QA pairs across various tasks. The study highlights the challenges faced by current multimodal large language models in integrating audio and visual information, while demonstrating that combining visual and audio models with temporal alignment techniques can enhance performance. The paper also presents a QA generation pipeline to improve efficiency and scalability in evaluation.