Ollama has introduced a new engine that supports multimodal models, emphasizing improved accuracy, model modularity, and memory management. The update allows for better integration of vision and text models, enhancing the capabilities of local inference for various applications, including image recognition and reasoning. Future developments will focus on supporting longer context sizes and enabling advanced functionalities.
R-4B is a multimodal large language model that enhances general-purpose auto-thinking by dynamically switching between thinking and non-thinking modes based on task complexity. It employs a two-stage training approach to improve response efficiency and reduce computational costs, achieving state-of-the-art performance among similar models. The model is open-source and offers user control over its thinking capabilities.
The repository provides an implementation of the method "Learning Compact Vision Tokens for Efficient Large Multimodal Models," which enhances inference efficiency by fusing spatial-adjacent vision tokens and introducing a Multi-Block Token Fusion module. Experimental results show that this approach achieves competitive performance on various vision-language benchmarks while using only 25% of the baseline vision tokens.