Quit Emailing Yourself

4 links tagged with all of: multimodal + reinforcement-learning

Click any tag below to further narrow down your results

Links

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Mini-o3 introduces an advanced system that enhances tool-based interactions for visual reasoning by supporting deep, multi-turn reasoning and achieving state-of-the-art performance on visual search tasks. The system utilizes a novel over-turn masking strategy to effectively manage response lengths during reinforcement learning, combined with a comprehensive dataset designed for exploratory reasoning. Open-source code and models are provided to facilitate reproducibility and further research.

Saved by tldr-importer · Last saved October 29, 2025 · 6 min read

+ visual-search multimodal ✓ reinforcement-learning ✓ + open-source + dataset

GitHub - OpenGVLab/VideoChat-R1: [NIPS2025] VideoChat-R1 & R1.5: Enhancing Spatio-Temporal Perception and Reasoning via Reinforcement Fine-Tuning

The VideoChat-R1.5 model has been released on Huggingface, showcasing improved capabilities in spatio-temporal perception and reasoning through multi-task joint reinforcement learning. It has been accepted at NIPS2025 and builds on previous versions, enhancing video reasoning across various applications. The model utilizes hierarchical human attention during inference for better localization of regions of interest in videos.

Saved by tldr-importer · Last saved October 29, 2025 · 1 min read

+ video-chat reinforcement-learning ✓ + spatio-temporal multimodal ✓ + nips2025

GitHub - VARGPT-family/VARGPT-v1.1: VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning

VARGPT-v1.1 is a powerful multimodal model that enhances visual understanding and generation capabilities through iterative instruction tuning and reinforcement learning. It includes extensive code releases for training, inference, and evaluation, as well as a comprehensive structure for multimodal tasks such as image captioning and visual question answering. The model's checkpoints and datasets are available on Hugging Face, facilitating further research and application development.

Saved by tldr-importer · Last saved October 29, 2025 · 4 min read

+ vargpt multimodal ✓ reinforcement-learning ✓ + image-generation + visual-understanding

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

InternVL3.5 introduces a new family of open-source multimodal models that enhance versatility, reasoning capabilities, and inference efficiency. A key innovation is the Cascade Reinforcement Learning framework, which improves reasoning tasks significantly while a Visual Resolution Router optimizes visual token resolution. The model achieves notable performance gains and supports advanced capabilities like GUI interaction and embodied agency, positioning it competitively against leading commercial models.

Saved by tldr-importer · Last saved October 29, 2025 · 2 min read

multimodal ✓ + reasoning reinforcement-learning ✓ + open-source + computer-vision