Quit Emailing Yourself

# multimodal → open-source

9 links tagged with all of: multimodal + open-source

Click any tag below to further narrow down your results

Links

GitHub - yannqi/R-4B: The official repository of "R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Integration"

R-4B is a multimodal large language model that enhances general-purpose auto-thinking by dynamically switching between thinking and non-thinking modes based on task complexity. It employs a two-stage training approach to improve response efficiency and reduce computational costs, achieving state-of-the-art performance among similar models. The model is open-source and offers user control over its thinking capabilities.

Saved by tldr-importer · Last saved October 29, 2025 · 3 min read

multimodal ✓ + language-model + auto-thinking open-source ✓ + inference

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

The paper presents BLIP3-o, a family of fully open unified multimodal models that enhance both image understanding and generation. It introduces a diffusion transformer for generating CLIP image features, advocates for a sequential pretraining strategy, and proposes a high-quality dataset, BLIP3o-60k, to improve performance across various benchmarks. The models, along with code and datasets, are open-sourced to foster further research.

Saved by tldr-importer · Last saved October 29, 2025 · 2 min read

multimodal ✓ + image-generation + computer-vision + deep-learning open-source ✓

Announcing LMEval: An Open Source Framework for Cross-Model Evaluation

LMEval, an open-source framework developed by Google, simplifies the evaluation of large language models across various providers by offering multi-provider compatibility, incremental evaluation, and multimodal support. With features like a self-encrypting database and an interactive visualization tool called LMEvalboard, it enhances the benchmarking process, making it easier for developers and researchers to assess model performance efficiently.

Saved by tldr-importer · Last saved October 29, 2025 · 3 min read

+ lmeval + model-evaluation open-source ✓ + benchmarking multimodal ✓

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Mini-o3 introduces an advanced system that enhances tool-based interactions for visual reasoning by supporting deep, multi-turn reasoning and achieving state-of-the-art performance on visual search tasks. The system utilizes a novel over-turn masking strategy to effectively manage response lengths during reinforcement learning, combined with a comprehensive dataset designed for exploratory reasoning. Open-source code and models are provided to facilitate reproducibility and further research.

Saved by tldr-importer · Last saved October 29, 2025 · 6 min read

+ visual-search multimodal ✓ + reinforcement-learning open-source ✓ + dataset

GitHub - Tencent-Hunyuan/HunyuanImage-3.0: HunyuanImage-3.0: A Powerful Native Multimodal Model for Image Generation

HunyuanImage-3.0 has been released as an open-source image generation model, featuring a unified multimodal architecture that integrates text and image understanding. It boasts the largest Mixture of Experts model with 80 billion parameters, enabling superior image generation capabilities while supporting extensive customization through various checkpoints and performance optimizations.

Saved by tldr-importer · Last saved October 29, 2025 · 6 min read

+ image-generation open-source ✓ multimodal ✓ + artificial-intelligence + deep-learning

GitHub - MoonshotAI/Kimi-VL: Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities

Kimi-VL is an open-source Mixture-of-Experts vision-language model that excels in multimodal reasoning and long-context understanding with only 2.8B activated parameters. It demonstrates superior performance in various tasks such as multi-turn interactions, video comprehension, and mathematical reasoning, competing effectively with larger models while maintaining efficiency. The latest variant, Kimi-VL-A3B-Thinking-2506, enhances reasoning and visual perception capabilities, achieving state-of-the-art results in several benchmarks.

Saved by tldr-importer · Last saved October 29, 2025 · 5 min read

+ vision-language multimodal ✓ + reasoning open-source ✓ + model

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

InternVL3.5 introduces a new family of open-source multimodal models that enhance versatility, reasoning capabilities, and inference efficiency. A key innovation is the Cascade Reinforcement Learning framework, which improves reasoning tasks significantly while a Visual Resolution Router optimizes visual token resolution. The model achieves notable performance gains and supports advanced capabilities like GUI interaction and embodied agency, positioning it competitively against leading commercial models.

Saved by tldr-importer · Last saved October 29, 2025 · 2 min read

multimodal ✓ + reasoning + reinforcement-learning open-source ✓ + computer-vision

Introducing Command A Vision: Multimodal AI built for Business

Command A Vision is a state-of-the-art vision-language model designed for business applications, excelling in multimodal tasks such as document OCR and image analysis. With a 112B parameter architecture, it outperforms competitors like GPT-4.1 and Llama 4 Maverick on various benchmarks, making it a powerful tool for enterprises seeking to automate processes and enhance decision-making. The model is available with open weights for community use.

Saved by tldr-importer · Last saved October 29, 2025 · 4 min read

multimodal ✓ + ai + business + ocr open-source ✓

GitHub - leigest519/ScreenCoder: ScreenCoder — Turn any UI screenshot into clean, editable HTML/CSS with full control. Fast, accurate, and easy to customize.

ScreenCoder is an advanced UI-to-code generation system that converts screenshots or design mockups into production-ready HTML/CSS code using a modular multi-agent architecture. It facilitates easy customization and rapid prototyping, bridging the gap between design and development. The project includes a demo, benchmark dataset, and detailed instructions for setup and usage.

Saved by tldr-importer · Last saved October 29, 2025 · 3 min read

+ ui-to-code + automation + html-css multimodal ✓ open-source ✓