Click any tag below to further narrow down your results
Links
Gemini 3 is Google's latest AI model series focused on advanced reasoning and multimodal tasks. It includes different versions like Pro, Flash, and Pro Image, each tailored for specific needs. The article covers key features, API usage, pricing, and new parameters for controlling model behavior.
This article presents a codebase for a study on how unified multimodal models (UMMs) enhance reasoning by integrating visual generation. The research introduces a new evaluation suite, VisWorld-Eval, which assesses multimodal reasoning capabilities across various tasks. Experiments show that interleaved visual-verbal reasoning outperforms purely verbal methods in specific contexts.
Daily-Omni is introduced as a new benchmark for audio-visual reasoning, featuring 684 videos and 1197 QA pairs across various tasks. The study highlights the challenges faced by current multimodal large language models in integrating audio and visual information, while demonstrating that combining visual and audio models with temporal alignment techniques can enhance performance. The paper also presents a QA generation pipeline to improve efficiency and scalability in evaluation.
Kimi-VL is an open-source Mixture-of-Experts vision-language model that excels in multimodal reasoning and long-context understanding with only 2.8B activated parameters. It demonstrates superior performance in various tasks such as multi-turn interactions, video comprehension, and mathematical reasoning, competing effectively with larger models while maintaining efficiency. The latest variant, Kimi-VL-A3B-Thinking-2506, enhances reasoning and visual perception capabilities, achieving state-of-the-art results in several benchmarks.
InternVL3.5 introduces a new family of open-source multimodal models that enhance versatility, reasoning capabilities, and inference efficiency. A key innovation is the Cascade Reinforcement Learning framework, which improves reasoning tasks significantly while a Visual Resolution Router optimizes visual token resolution. The model achieves notable performance gains and supports advanced capabilities like GUI interaction and embodied agency, positioning it competitively against leading commercial models.
Vision Language Models (VLMs) have evolved significantly over the past year, showcasing advancements in any-to-any architectures, reasoning capabilities, and the emergence of multimodal agents. New trends include smaller yet powerful models, innovative alignment techniques, and the introduction of Vision-Language-Action models that enhance robotic interactions. The article highlights key developments and model recommendations in the rapidly growing field of VLMs.