4 links
tagged with all of: reasoning + multimodal
Click any tag below to further narrow down your results
Links
Daily-Omni is introduced as a new benchmark for audio-visual reasoning, featuring 684 videos and 1197 QA pairs across various tasks. The study highlights the challenges faced by current multimodal large language models in integrating audio and visual information, while demonstrating that combining visual and audio models with temporal alignment techniques can enhance performance. The paper also presents a QA generation pipeline to improve efficiency and scalability in evaluation.
Kimi-VL is an open-source Mixture-of-Experts vision-language model that excels in multimodal reasoning and long-context understanding with only 2.8B activated parameters. It demonstrates superior performance in various tasks such as multi-turn interactions, video comprehension, and mathematical reasoning, competing effectively with larger models while maintaining efficiency. The latest variant, Kimi-VL-A3B-Thinking-2506, enhances reasoning and visual perception capabilities, achieving state-of-the-art results in several benchmarks.
InternVL3.5 introduces a new family of open-source multimodal models that enhance versatility, reasoning capabilities, and inference efficiency. A key innovation is the Cascade Reinforcement Learning framework, which improves reasoning tasks significantly while a Visual Resolution Router optimizes visual token resolution. The model achieves notable performance gains and supports advanced capabilities like GUI interaction and embodied agency, positioning it competitively against leading commercial models.
Vision Language Models (VLMs) have evolved significantly over the past year, showcasing advancements in any-to-any architectures, reasoning capabilities, and the emergence of multimodal agents. New trends include smaller yet powerful models, innovative alignment techniques, and the introduction of Vision-Language-Action models that enhance robotic interactions. The article highlights key developments and model recommendations in the rapidly growing field of VLMs.