4 links
tagged with all of: open-source + vision-language
Click any tag below to further narrow down your results
Links
MolmoAct is an innovative Action Reasoning Model (ARM) developed to enhance spatial reasoning in robotics, allowing machines to understand and execute tasks in three-dimensional space. Built on the open-source Molmo framework, MolmoAct utilizes depth-aware perception tokens for improved action planning and execution, demonstrating superior performance and generalization capabilities in real-world scenarios. The model is fully open-source, promoting transparency and accessibility for further research and development in the field.
TimeScope is an open-source benchmark that evaluates the understanding of long videos by vision-language models through localized retrieval, information synthesis, and fine-grained temporal perception. By inserting short video clips into longer videos, it challenges models to demonstrate true temporal comprehension rather than surface-level recognition, revealing that many state-of-the-art models struggle with these tasks. The benchmark aims to drive improvements in how multimodal systems are trained and evaluated.
Kimi-VL is an open-source Mixture-of-Experts vision-language model that excels in multimodal reasoning and long-context understanding with only 2.8B activated parameters. It demonstrates superior performance in various tasks such as multi-turn interactions, video comprehension, and mathematical reasoning, competing effectively with larger models while maintaining efficiency. The latest variant, Kimi-VL-A3B-Thinking-2506, enhances reasoning and visual perception capabilities, achieving state-of-the-art results in several benchmarks.
SmolVLA is a compact and open-source Vision-Language-Action model designed for robotics, capable of running on consumer hardware and trained on community-shared datasets. It significantly outperforms larger models in both simulation and real-world tasks, while offering faster response times through asynchronous inference. The model's lightweight architecture and efficient training methods aim to democratize access to advanced robotics capabilities.