Quit Emailing Yourself

4 links tagged with all of: open-source + vision-language

Click any tag below to further narrow down your results

Links

MolmoAct: An Action Reasoning Model that reasons in 3D space | Ai2

MolmoAct is an innovative Action Reasoning Model (ARM) developed to enhance spatial reasoning in robotics, allowing machines to understand and execute tasks in three-dimensional space. Built on the open-source Molmo framework, MolmoAct utilizes depth-aware perception tokens for improved action planning and execution, demonstrating superior performance and generalization capabilities in real-world scenarios. The model is fully open-source, promoting transparency and accessibility for further research and development in the field.

Saved by tldr-importer · Last saved October 29, 2025 · 5 min read

+ action-reasoning + robotics + spatial-reasoning open-source ✓ vision-language ✓

TimeScope: How Long Can Your Video Large Multimodal Model Go?

TimeScope is an open-source benchmark that evaluates the understanding of long videos by vision-language models through localized retrieval, information synthesis, and fine-grained temporal perception. By inserting short video clips into longer videos, it challenges models to demonstrate true temporal comprehension rather than surface-level recognition, revealing that many state-of-the-art models struggle with these tasks. The benchmark aims to drive improvements in how multimodal systems are trained and evaluated.

Saved by tldr-importer · Last saved October 29, 2025 · 5 min read

+ video-benchmark + multimodal-ai + temporal-comprehension vision-language ✓ open-source ✓

GitHub - MoonshotAI/Kimi-VL: Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities

Kimi-VL is an open-source Mixture-of-Experts vision-language model that excels in multimodal reasoning and long-context understanding with only 2.8B activated parameters. It demonstrates superior performance in various tasks such as multi-turn interactions, video comprehension, and mathematical reasoning, competing effectively with larger models while maintaining efficiency. The latest variant, Kimi-VL-A3B-Thinking-2506, enhances reasoning and visual perception capabilities, achieving state-of-the-art results in several benchmarks.

Saved by tldr-importer · Last saved October 29, 2025 · 5 min read

vision-language ✓ + multimodal + reasoning open-source ✓ + model

SmolVLA: Efficient Vision-Language-Action Model trained on Lerobot Community Data

SmolVLA is a compact and open-source Vision-Language-Action model designed for robotics, capable of running on consumer hardware and trained on community-shared datasets. It significantly outperforms larger models in both simulation and real-world tasks, while offering faster response times through asynchronous inference. The model's lightweight architecture and efficient training methods aim to democratize access to advanced robotics capabilities.

Saved by tldr-importer · Last saved October 29, 2025 · 6 min read

+ smolvla + robotics vision-language ✓ open-source ✓ + inference