52 links
tagged with multimodal
Click any tag below to further narrow down your results
Links
Liquid is an innovative auto-regressive model that integrates visual comprehension and generation by tokenizing images into discrete codes and learning them alongside text tokens. This multimodal large language model operates within a shared feature space, allowing for seamless understanding and generation without relying on external visual embeddings. Liquid is available in multiple sizes and explores the scaling laws of multimodal models, revealing mutual benefits between understanding and generation tasks.
Daft is a distributed query engine designed for large-scale data processing using Python or SQL, built with Rust. It offers a familiar interactive API, powerful query optimization, and seamless integration with data catalogs and multimodal types, making it suitable for complex data operations in cloud environments. Daft supports interactive and distributed computing, allowing users to efficiently handle diverse data types and perform operations across large clusters.
SpatialScore introduces a comprehensive benchmark for evaluating multimodal large language models (MLLMs) in spatial understanding, consisting of the VGBench dataset and an extensive collection of 28K samples. It features the SpatialAgent, a multi-agent system designed for enhanced spatial reasoning, and reveals persistent challenges and improvements in spatial tasks through quantitative and qualitative evaluations.
Salesforce discusses the development of real-time multimodal AI pipelines capable of processing up to 50 million file uploads daily. The article highlights the challenges and solutions involved in scaling file processing to meet the demands of modern data workflows. Key techniques and technologies that enable efficient processing are also emphasized.
Ollama has introduced a new engine that supports multimodal models, emphasizing improved accuracy, model modularity, and memory management. The update allows for better integration of vision and text models, enhancing the capabilities of local inference for various applications, including image recognition and reasoning. Future developments will focus on supporting longer context sizes and enabling advanced functionalities.
Google has launched two new models in the Gemini family, Gemini 2.5 Pro and Gemini 2.5 Flash, which significantly enhance video understanding capabilities. The Pro model achieves state-of-the-art performance in various benchmarks and enables innovative applications like interactive learning tools and dynamic animations from video content. Both models facilitate advanced video processing and offer cost-effective solutions for diverse use cases in education and content creation.
KGMEL is a novel framework for multimodal entity linking that enhances the alignment of textual mentions with knowledge base entities by incorporating knowledge graph (KG) triples. It operates in three stages: generating high-quality triples, learning joint representations through contrastive learning, and refining candidate entities using large language models. Experimental results show that KGMEL outperforms existing methods in accuracy and efficiency.
Google Search has introduced a significant update to its AI Mode, allowing users to conduct visual searches more naturally by asking questions conversationally or uploading images. This update enhances shopping experiences by providing relevant visual results based on user descriptions, supported by a robust Shopping Graph that refreshes product listings frequently. The new features leverage advanced visual understanding and multimodal capabilities to refine search results and improve user engagement.
Gemini Robotics 1.5 introduces advanced AI models that enable robots to perceive, plan, and execute complex tasks in the physical world. The models enhance a robot's ability to reason, learn across different embodiments, and interact naturally, marking a significant step towards achieving artificial general intelligence (AGI) in robotics. Developers can access these capabilities through the Gemini API in Google AI Studio.
OmniSVG is a unified framework for generating high-quality scalable vector graphics (SVG) using pre-trained Vision-Language Models (VLMs), which decouples structural logic from low-level geometry. It introduces the MMSVG-2M dataset with two million annotated SVG assets and supports multiple generation modalities, demonstrating superior performance over existing methods for diverse creative tasks. The model is designed to handle complexity ranging from simple icons to intricate illustrations, offering flexibility for professional design workflows.
User interfaces (UI) are not disappearing due to advancements in AI; instead, they are evolving and becoming more essential for effective interaction. AI is driving innovation in UI design, leading to multimodal experiences and hyper-personalization that enhance user engagement and accessibility. The future of UX will involve AI working in tandem with UI, providing users with intuitive controls and feedback rather than relying solely on text or voice interfaces.
OmDet-Turbo is a real-time open-vocabulary object detection model that integrates components from RT-DETR and features an Efficient Fusion Head for enhanced performance. It achieves impressive results with up to 100.2 FPS and 53.4 AP on COCO zero-shot, making it suitable for industrial applications that require rapid and accurate detection capabilities. The model's unique architecture allows for efficient text embedding caching, improving the decoding process for object detection tasks.
R-4B is a multimodal large language model that enhances general-purpose auto-thinking by dynamically switching between thinking and non-thinking modes based on task complexity. It employs a two-stage training approach to improve response efficiency and reduce computational costs, achieving state-of-the-art performance among similar models. The model is open-source and offers user control over its thinking capabilities.
OmniCaptioner is a versatile visual captioning framework designed to generate detailed textual descriptions across various visual domains, including natural images, visual text, and structured visuals. It enhances visual reasoning with large language models (LLMs), improves image generation tasks, and allows for efficient supervised fine-tuning by converting pixel data into rich semantic representations. The framework aims to bridge the gap between visual and textual modalities through a unified multimodal pretraining approach.
Daily-Omni is introduced as a new benchmark for audio-visual reasoning, featuring 684 videos and 1197 QA pairs across various tasks. The study highlights the challenges faced by current multimodal large language models in integrating audio and visual information, while demonstrating that combining visual and audio models with temporal alignment techniques can enhance performance. The paper also presents a QA generation pipeline to improve efficiency and scalability in evaluation.
The article discusses the integration of multimodal large language models (LLMs) into various applications, highlighting their ability to process and generate content across different modalities such as text, images, and audio. It emphasizes the advancements in model architectures and training techniques that enhance the performance and versatility of these models in real-world scenarios. Additionally, the piece explores potential use cases and the impact of multimodal capabilities on industries and user interactions.
DoorDash has launched robot deliveries in Los Angeles and Chicago through a partnership with Coco Robotics, allowing eligible customers to receive deliveries from over 600 merchants. This initiative is part of DoorDash's strategy to incorporate multimodal delivery options, which include human workers, drones, and autonomous robots, aiming to reduce costs and environmental impact.
Google has released updated versions of the Gemini 2.5 Flash and Flash-Lite models, enhancing quality and efficiency with significant reductions in output tokens and improved capabilities in instruction following, conciseness, and multimodal functions. The updates aim to facilitate better performance in complex applications while allowing users to easily access the latest models through new aliases.
The article focuses on multimodal data analytics, emphasizing its significance in extracting insights from various types of data sources, such as text, images, and audio. It provides practical guidance on methodologies and tools that can be employed to leverage multimodal data for enhanced decision-making and predictive analytics. The content underscores the importance of integrating different modalities to improve the accuracy and depth of data analysis.
Google has introduced AI Mode in Search, enhancing user experience with advanced reasoning and multimodal capabilities, allowing for deeper inquiries and personalized responses. The new features include Deep Search for thorough research, live interaction with visual search, agentic capabilities for task management, and tailored suggestions based on user context. These updates aim to transform Google Search from a mere information tool to a comprehensive intelligence platform.
MingTok introduces the first continuous unified tokenizer for vision, enabling seamless integration of image understanding and generation within a single framework. This innovation leads to 3.5x faster convergence by aligning semantic understanding and generative dynamics, allowing for efficient multi-turn interactions without the costly detours seen in previous models. Ming-UniVision, built on MingTok, effectively harmonizes these tasks, paving the way for more intuitive multimodal AI systems.
Multimodal learning faces challenges when modalities differ between development and deployment due to various factors, including perceived informativeness and missing data. The framework ICYM2I (In Case You Multimodal Missed It) is introduced to address biases in estimating information gain from modalities under missingness, using inverse probability weighting-based correction. The effectiveness of this approach is demonstrated through synthetic and real-world medical datasets.
The paper presents BLIP3-o, a family of fully open unified multimodal models that enhance both image understanding and generation. It introduces a diffusion transformer for generating CLIP image features, advocates for a sequential pretraining strategy, and proposes a high-quality dataset, BLIP3o-60k, to improve performance across various benchmarks. The models, along with code and datasets, are open-sourced to foster further research.
AMIE, a multimodal conversational AI agent developed by Google DeepMind, has been enhanced to intelligently request and interpret visual medical information during clinical dialogues, emulating the structured history-taking of experienced clinicians. Evaluations show that AMIE can match or exceed primary care physicians in diagnostic accuracy and empathy while utilizing multimodal data effectively in simulated consultations. Ongoing research aims to further refine AMIE's capabilities using advanced models and assess its performance in real-world clinical settings.
Google DeepMind has unveiled the Gemini Robotics models, which enhance robots' capabilities to perform complex tasks through natural language understanding and dexterity. These multimodal models allow robots to adapt to various environments and instructions, paving the way for future applications in everyday life and industry. Carolina Parada emphasizes the potential of embodied AI to transform how robots assist with daily tasks.
Gemini models 2.5 Pro and Flash are revolutionizing robotics with advanced coding, reasoning, and multimodal capabilities, enhancing robots' spatial understanding. Developers can utilize these models and the Live API for applications such as semantic scene understanding, spatial reasoning, and interactive robotics, enabling robots to execute complex tasks through voice commands and code generation. The article highlights practical examples and the potential of Gemini's embodied reasoning model in various robotics applications.
The repository provides an implementation of the method "Learning Compact Vision Tokens for Efficient Large Multimodal Models," which enhances inference efficiency by fusing spatial-adjacent vision tokens and introducing a Multi-Block Token Fusion module. Experimental results show that this approach achieves competitive performance on various vision-language benchmarks while using only 25% of the baseline vision tokens.
LLaMA 4 introduces advanced multimodal intelligence capabilities that enhance user interactions by integrating various data types such as text, images, and audio. The model aims to improve understanding and generation across different modalities, making it more versatile for practical applications in AI. Key features include refined training techniques and a focus on user-centric design to facilitate more intuitive AI experiences.
Voxtral Mini and Voxtral Small are two multimodal audio chat models designed to understand both spoken audio and text. They achieve state-of-the-art performance on various audio benchmarks while maintaining strong text capabilities, with Voxtral Small being efficient enough for local deployment. The models include a 32K context window for processing lengthy audio and multi-turn conversations and come with three new benchmarks for evaluating speech understanding in knowledge and trivia.
HunyuanImage-3.0 has been released as an open-source image generation model, featuring a unified multimodal architecture that integrates text and image understanding. It boasts the largest Mixture of Experts model with 80 billion parameters, enabling superior image generation capabilities while supporting extensive customization through various checkpoints and performance optimizations.
Mini-o3 introduces an advanced system that enhances tool-based interactions for visual reasoning by supporting deep, multi-turn reasoning and achieving state-of-the-art performance on visual search tasks. The system utilizes a novel over-turn masking strategy to effectively manage response lengths during reinforcement learning, combined with a comprehensive dataset designed for exploratory reasoning. Open-source code and models are provided to facilitate reproducibility and further research.
LMEval, an open-source framework developed by Google, simplifies the evaluation of large language models across various providers by offering multi-provider compatibility, incremental evaluation, and multimodal support. With features like a self-encrypting database and an interactive visualization tool called LMEvalboard, it enhances the benchmarking process, making it easier for developers and researchers to assess model performance efficiently.
Google has introduced Gemma 3n, a new open model designed for optimized on-device AI performance, enabling real-time processing on mobile devices. Built on a cutting-edge architecture in collaboration with hardware leaders, Gemma 3n features advanced capabilities like multimodal understanding, improved multilingual support, and innovations that reduce memory usage. Developers can access a preview of this model now to start building efficient AI applications.
The VideoChat-R1.5 model has been released on Huggingface, showcasing improved capabilities in spatio-temporal perception and reasoning through multi-task joint reinforcement learning. It has been accepted at NIPS2025 and builds on previous versions, enhancing video reasoning across various applications. The model utilizes hierarchical human attention during inference for better localization of regions of interest in videos.
Kimi-VL is an open-source Mixture-of-Experts vision-language model that excels in multimodal reasoning and long-context understanding with only 2.8B activated parameters. It demonstrates superior performance in various tasks such as multi-turn interactions, video comprehension, and mathematical reasoning, competing effectively with larger models while maintaining efficiency. The latest variant, Kimi-VL-A3B-Thinking-2506, enhances reasoning and visual perception capabilities, achieving state-of-the-art results in several benchmarks.
Complete the intermediate course on implementing multimodal vector search with BigQuery, which takes 1 hour and 45 minutes. Participants will learn to use Gemini for SQL generation, conduct sentiment analysis, summarize text, generate embeddings, create a Retrieval Augmented Generation (RAG) pipeline, and perform multimodal vector searches.
Site owners and content creators are encouraged to focus on providing unique and valuable content to succeed in Google's AI search experiences. Key strategies include ensuring a great page experience, meeting technical requirements, managing visibility, and adapting to evolving user needs in search behavior. Emphasizing multimodal content and understanding visitor engagement are also crucial for maximizing the value of search traffic.
Meta's Llama 4 models, including Llama 4 Scout 17B and Llama 4 Maverick 17B, are now available in Amazon Bedrock as a serverless solution, offering advanced multimodal capabilities for applications. These models leverage a mixture-of-experts architecture to enhance performance and support a wide range of use cases, from enterprise applications to customer support and content creation. Users can easily integrate these models into their applications using the Amazon Bedrock Converse API.
Gemini 2.5 Pro Preview has been released ahead of schedule, featuring enhanced capabilities for coding and building interactive web apps. This update builds on positive feedback from the previous version, improving performance in UI development, code transformation, and multimodal reasoning, and now leads the WebDev Arena Leaderboard. Developers can access these features through the Gemini API and Google AI Studio.
Join Javier Hernandez in a webinar on April 24th to explore how HP's AI Studio utilizes multimodal large language models to analyze diverse medical data formats, including text, images, and audio. This session will cover the creation of real-world applications, challenges faced, and strategies for enhancing data-driven decision-making in medical research and diagnostics.
Qwen3-Omni is a cutting-edge multilingual omni-modal foundation model capable of processing text, images, audio, and video, providing real-time streaming responses. It features significant architectural advancements for performance, supports 119 text languages, and offers various applications through detailed cookbooks, including speech recognition, audio captioning, and video analysis. The model is available for use via Hugging Face and ModelScope, with recommendations for optimal performance.
OpenAI's latest models, o3 and o4-mini, enhance visual reasoning capabilities by enabling the integration of image processing within their chain-of-thought, allowing for more thorough analyses and problem-solving. These advancements significantly outperform previous models across various multimodal benchmarks, marking a crucial step in multimodal reasoning.
InternVL3.5 introduces a new family of open-source multimodal models that enhance versatility, reasoning capabilities, and inference efficiency. A key innovation is the Cascade Reinforcement Learning framework, which improves reasoning tasks significantly while a Visual Resolution Router optimizes visual token resolution. The model achieves notable performance gains and supports advanced capabilities like GUI interaction and embodied agency, positioning it competitively against leading commercial models.
Ollama has launched a new app for macOS and Windows that allows users to chat with models, process files through drag and drop, and utilize a multimodal engine for image interaction. The app also supports increased context length for handling larger documents and provides options for documentation writing. Users can download the app or access CLI versions from Ollama's GitHub releases page.
Google AI Studio has introduced new features and capabilities for developers using the Gemini API, including enhanced code generation with Gemini 2.5 Pro, multimodal media generation, and improved deployment options via Cloud Run. The platform supports interactive app development and offers advanced audio dialogue and text-to-speech functionalities, making it easier to build intuitive, AI-powered applications. Additional tools like the Model Context Protocol and URL Context are also available for deeper integration and content retrieval.
VARGPT-v1.1 is a powerful multimodal model that enhances visual understanding and generation capabilities through iterative instruction tuning and reinforcement learning. It includes extensive code releases for training, inference, and evaluation, as well as a comprehensive structure for multimodal tasks such as image captioning and visual question answering. The model's checkpoints and datasets are available on Hugging Face, facilitating further research and application development.
Meta has launched Llama 4, introducing two new AI models, Llama 4 Scout and Llama 4 Maverick, now available for use in WhatsApp, Messenger, and Instagram. The Maverick model is designed for general assistant tasks and excels in image and text understanding, while Scout focuses on multi-document summarization and personalized tasks. Additionally, Meta is set to release a third model, Llama 4 Behemoth, with a significant number of parameters, and another model, Llama 4 Reasoning, in the near future.
Vision Language Models (VLMs) have evolved significantly over the past year, showcasing advancements in any-to-any architectures, reasoning capabilities, and the emergence of multimodal agents. New trends include smaller yet powerful models, innovative alignment techniques, and the introduction of Vision-Language-Action models that enhance robotic interactions. The article highlights key developments and model recommendations in the rapidly growing field of VLMs.
Command A Vision is a state-of-the-art vision-language model designed for business applications, excelling in multimodal tasks such as document OCR and image analysis. With a 112B parameter architecture, it outperforms competitors like GPT-4.1 and Llama 4 Maverick on various benchmarks, making it a powerful tool for enterprises seeking to automate processes and enhance decision-making. The model is available with open weights for community use.
3D CoCa is a unified framework for 3D captioning that integrates contrastive vision-language learning with 3D caption generation. By leveraging a frozen CLIP backbone and a spatially-aware 3D scene encoder, it jointly optimizes contrastive and captioning objectives in a shared feature space, leading to improved spatial reasoning and semantic grounding. Extensive experiments show that 3D CoCa surpasses existing methods, achieving significant performance gains on benchmark datasets.
ScreenCoder is an advanced UI-to-code generation system that converts screenshots or design mockups into production-ready HTML/CSS code using a modular multi-agent architecture. It facilitates easy customization and rapid prototyping, bridging the gap between design and development. The project includes a demo, benchmark dataset, and detailed instructions for setup and usage.
The article introduces the Pico-Banana-400K dataset, a large-scale collection of 400,000 images designed for text-guided image editing. It aims to address the limitations in existing datasets by providing high-quality, diverse edit pairs generated from real photographs, facilitating advanced research in multimodal image editing techniques. The dataset includes specialized subsets for multi-turn editing, preference research, and instruction summarization.