9 links
tagged with all of: multimodal + open-source
Click any tag below to further narrow down your results
Links
R-4B is a multimodal large language model that enhances general-purpose auto-thinking by dynamically switching between thinking and non-thinking modes based on task complexity. It employs a two-stage training approach to improve response efficiency and reduce computational costs, achieving state-of-the-art performance among similar models. The model is open-source and offers user control over its thinking capabilities.
The paper presents BLIP3-o, a family of fully open unified multimodal models that enhance both image understanding and generation. It introduces a diffusion transformer for generating CLIP image features, advocates for a sequential pretraining strategy, and proposes a high-quality dataset, BLIP3o-60k, to improve performance across various benchmarks. The models, along with code and datasets, are open-sourced to foster further research.
LMEval, an open-source framework developed by Google, simplifies the evaluation of large language models across various providers by offering multi-provider compatibility, incremental evaluation, and multimodal support. With features like a self-encrypting database and an interactive visualization tool called LMEvalboard, it enhances the benchmarking process, making it easier for developers and researchers to assess model performance efficiently.
Mini-o3 introduces an advanced system that enhances tool-based interactions for visual reasoning by supporting deep, multi-turn reasoning and achieving state-of-the-art performance on visual search tasks. The system utilizes a novel over-turn masking strategy to effectively manage response lengths during reinforcement learning, combined with a comprehensive dataset designed for exploratory reasoning. Open-source code and models are provided to facilitate reproducibility and further research.
HunyuanImage-3.0 has been released as an open-source image generation model, featuring a unified multimodal architecture that integrates text and image understanding. It boasts the largest Mixture of Experts model with 80 billion parameters, enabling superior image generation capabilities while supporting extensive customization through various checkpoints and performance optimizations.
Kimi-VL is an open-source Mixture-of-Experts vision-language model that excels in multimodal reasoning and long-context understanding with only 2.8B activated parameters. It demonstrates superior performance in various tasks such as multi-turn interactions, video comprehension, and mathematical reasoning, competing effectively with larger models while maintaining efficiency. The latest variant, Kimi-VL-A3B-Thinking-2506, enhances reasoning and visual perception capabilities, achieving state-of-the-art results in several benchmarks.
InternVL3.5 introduces a new family of open-source multimodal models that enhance versatility, reasoning capabilities, and inference efficiency. A key innovation is the Cascade Reinforcement Learning framework, which improves reasoning tasks significantly while a Visual Resolution Router optimizes visual token resolution. The model achieves notable performance gains and supports advanced capabilities like GUI interaction and embodied agency, positioning it competitively against leading commercial models.
Command A Vision is a state-of-the-art vision-language model designed for business applications, excelling in multimodal tasks such as document OCR and image analysis. With a 112B parameter architecture, it outperforms competitors like GPT-4.1 and Llama 4 Maverick on various benchmarks, making it a powerful tool for enterprises seeking to automate processes and enhance decision-making. The model is available with open weights for community use.
ScreenCoder is an advanced UI-to-code generation system that converts screenshots or design mockups into production-ready HTML/CSS code using a modular multi-agent architecture. It facilitates easy customization and rapid prototyping, bridging the gap between design and development. The project includes a demo, benchmark dataset, and detailed instructions for setup and usage.