Click any tag below to further narrow down your results
Links
Gemini 3 is Google's latest AI model series focused on advanced reasoning and multimodal tasks. It includes different versions like Pro, Flash, and Pro Image, each tailored for specific needs. The article covers key features, API usage, pricing, and new parameters for controlling model behavior.
This article explains how multimodal UX allows users to interact with digital products through various input methods like voice, touch, and gesture. It highlights the importance of designing for real human behavior and improving accessibility and user satisfaction by offering flexible interaction options.
The article discusses the launch of GLM-4.6V and GLM-4.5V, two advanced vision-language models. GLM-4.6V features a 128K context and supports multimodal inputs, while GLM-4.5V excels in visual reasoning across various benchmarks. Both models offer distinct capabilities for image and video analysis.
This article discusses the evolution of data engineering as it adapts to the growing role of AI agents in 2026. It emphasizes the need for reliability, context, and safety within data platforms, highlighting the shift from human-centric workflows to autonomous systems that require new architectural approaches.
This article presents a codebase for a study on how unified multimodal models (UMMs) enhance reasoning by integrating visual generation. The research introduces a new evaluation suite, VisWorld-Eval, which assesses multimodal reasoning capabilities across various tasks. Experiments show that interleaved visual-verbal reasoning outperforms purely verbal methods in specific contexts.
This article discusses various Qwen models, including Qwen3, Qwen3-Omni, and Qwen3-Next. These models offer advanced features for text, image, audio, and video processing, aiming to improve efficiency and performance in AI applications. The post also includes links to demos and resources for developers.
This article outlines a method for training judges for Vision-Language Models (VLMs) without human annotations. The approach uses self-synthesized data in an iterative process to improve judgment accuracy, resulting in notable performance gains on various evaluation benchmarks.
This article discusses a new method for understanding user intent by breaking down interactions on mobile devices into two stages. By summarizing individual screens and then extracting intent from those summaries, small models can achieve results similar to larger models without needing server processing. The approach improves efficiency and maintains user privacy.
Google unveiled Gemini 3, an advanced AI model designed to enhance coding and development workflows. It supports agentic coding, multimodal understanding, and allows users to create complex applications with simple prompts. Key features include the new Google Antigravity platform and improved tools for document and video reasoning.
Baidu released the ERNIE-4.5-VL-28B-A3B-Thinking, an AI model that claims to outperform Google and OpenAI’s offerings in visual reasoning while using fewer computing resources. The model features a unique dynamic image analysis capability that mimics human problem-solving. It’s designed for enterprise applications, including document processing and manufacturing quality control.
OpenAI has integrated voice mode into the main ChatGPT interface, allowing users to see text responses in real-time while speaking. This update eliminates the previous need to switch between separate screens, enhancing conversational flow and accessibility. Users can toggle back to the old voice mode if they prefer.
The GLM-4.6V series introduces two open-source multimodal models, designed for both high-performance cloud use and local deployment. It features a 128k token context window and native tool calling, enabling seamless integration of visual and textual inputs for tasks like content creation and web search.
Google has launched Gemini 3 Flash, a new model that enhances speed and reduces costs while maintaining advanced reasoning capabilities. It’s available for developers through various platforms and is rolling out to general users in the Gemini app and AI Mode in Search.
Google has launched Gemini 3, its most advanced AI model yet, which improves multimodal understanding and reasoning capabilities. It aims to assist users in learning, building, and planning by providing more nuanced and context-aware responses. The model is integrated across various Google products and available for developers.
Kimi K2.5 is an open-source multimodal model that enhances coding and vision tasks. It can self-direct up to 100 sub-agents for parallel workflows, significantly improving execution speed and efficiency. The model excels in real-world software engineering and office productivity tasks.
Bytedance has introduced Seedance 2.0, a multimodal AI video generation tool that combines images, videos, audio, and text to create short clips with automatic sound effects. The model features a unique reference capability, allowing users to replicate camera work and effects from uploaded videos. This release coincides with increased competition from Kuaishou's Kling 3.0, boosting share prices in the Chinese media and AI sectors.
Youtu-VL is a 4B-parameter Vision-Language Model that excels in both vision-centric and general multimodal tasks without needing task-specific modules. It uses a unique autoregressive supervision method to enhance visual understanding and preserve detailed information. The model supports various applications, from image classification to visual question answering.
The article discusses the development of Odyssey-2 Pro, a world simulator that predicts how the world evolves using vast amounts of video and interaction data instead of pre-defined rules. This approach allows the model to learn complex structures like physics and human behavior, enabling more interactive and stateful simulations.
This article presents Dynalang, an agent that connects language understanding with future predictions to improve task performance. Unlike traditional agents, Dynalang learns from both past and future language, enabling it to handle a variety of tasks more effectively. It can also be pretrained on text and video datasets without needing direct actions or rewards.
NVIDIA has released the Nemotron ColEmbed V2 models, designed for efficient multimodal document retrieval. These models utilize a late-interaction embedding approach to improve accuracy in handling text, images, and structured visual data. They perform well on the ViDoRe V3 benchmark, making them suitable for applications like multimedia search engines and conversational AI.
Zhipu AI has released GLM-4.7, a new version of its General Language Model designed for advanced coding and multimodal tasks. It improves reasoning capabilities and supports both text and vision inputs, making it suitable for developers and enterprises. The model features enhanced APIs for real-time and batch processing, aligning with demands for more sophisticated AI applications.
The article reviews Google’s Gemini 3 Pro, highlighting its improved features over Gemini 2.5, including audio transcription capabilities and performance benchmarks compared to other AI models. It details pricing, multimodal input support, and tests involving image analysis and a city council meeting audio transcript.
Qwen has released the Qwen3-VL-Embedding and Qwen3-VL-Reranker models, designed for advanced multimodal information retrieval and cross-modal understanding. These models support various inputs, including text and images, and enhance retrieval accuracy through a two-stage process of initial recall and precise re-ranking.
Multimodal vector databases like ApertureDB are revolutionizing how industries manage and verify data, particularly in healthcare advertising. By integrating various data types and employing AI tools, these databases enhance compliance by detecting omissions in marketing content, ensuring that critical information is accurately conveyed to patients.
Liquid is an innovative auto-regressive model that integrates visual comprehension and generation by tokenizing images into discrete codes and learning them alongside text tokens. This multimodal large language model operates within a shared feature space, allowing for seamless understanding and generation without relying on external visual embeddings. Liquid is available in multiple sizes and explores the scaling laws of multimodal models, revealing mutual benefits between understanding and generation tasks.
Daft is a distributed query engine designed for large-scale data processing using Python or SQL, built with Rust. It offers a familiar interactive API, powerful query optimization, and seamless integration with data catalogs and multimodal types, making it suitable for complex data operations in cloud environments. Daft supports interactive and distributed computing, allowing users to efficiently handle diverse data types and perform operations across large clusters.
SpatialScore introduces a comprehensive benchmark for evaluating multimodal large language models (MLLMs) in spatial understanding, consisting of the VGBench dataset and an extensive collection of 28K samples. It features the SpatialAgent, a multi-agent system designed for enhanced spatial reasoning, and reveals persistent challenges and improvements in spatial tasks through quantitative and qualitative evaluations.
Salesforce discusses the development of real-time multimodal AI pipelines capable of processing up to 50 million file uploads daily. The article highlights the challenges and solutions involved in scaling file processing to meet the demands of modern data workflows. Key techniques and technologies that enable efficient processing are also emphasized.
Ollama has introduced a new engine that supports multimodal models, emphasizing improved accuracy, model modularity, and memory management. The update allows for better integration of vision and text models, enhancing the capabilities of local inference for various applications, including image recognition and reasoning. Future developments will focus on supporting longer context sizes and enabling advanced functionalities.
Google has launched two new models in the Gemini family, Gemini 2.5 Pro and Gemini 2.5 Flash, which significantly enhance video understanding capabilities. The Pro model achieves state-of-the-art performance in various benchmarks and enables innovative applications like interactive learning tools and dynamic animations from video content. Both models facilitate advanced video processing and offer cost-effective solutions for diverse use cases in education and content creation.
OmniSVG is a unified framework for generating high-quality scalable vector graphics (SVG) using pre-trained Vision-Language Models (VLMs), which decouples structural logic from low-level geometry. It introduces the MMSVG-2M dataset with two million annotated SVG assets and supports multiple generation modalities, demonstrating superior performance over existing methods for diverse creative tasks. The model is designed to handle complexity ranging from simple icons to intricate illustrations, offering flexibility for professional design workflows.
KGMEL is a novel framework for multimodal entity linking that enhances the alignment of textual mentions with knowledge base entities by incorporating knowledge graph (KG) triples. It operates in three stages: generating high-quality triples, learning joint representations through contrastive learning, and refining candidate entities using large language models. Experimental results show that KGMEL outperforms existing methods in accuracy and efficiency.
User interfaces (UI) are not disappearing due to advancements in AI; instead, they are evolving and becoming more essential for effective interaction. AI is driving innovation in UI design, leading to multimodal experiences and hyper-personalization that enhance user engagement and accessibility. The future of UX will involve AI working in tandem with UI, providing users with intuitive controls and feedback rather than relying solely on text or voice interfaces.
OmDet-Turbo is a real-time open-vocabulary object detection model that integrates components from RT-DETR and features an Efficient Fusion Head for enhanced performance. It achieves impressive results with up to 100.2 FPS and 53.4 AP on COCO zero-shot, making it suitable for industrial applications that require rapid and accurate detection capabilities. The model's unique architecture allows for efficient text embedding caching, improving the decoding process for object detection tasks.
R-4B is a multimodal large language model that enhances general-purpose auto-thinking by dynamically switching between thinking and non-thinking modes based on task complexity. It employs a two-stage training approach to improve response efficiency and reduce computational costs, achieving state-of-the-art performance among similar models. The model is open-source and offers user control over its thinking capabilities.
OmniCaptioner is a versatile visual captioning framework designed to generate detailed textual descriptions across various visual domains, including natural images, visual text, and structured visuals. It enhances visual reasoning with large language models (LLMs), improves image generation tasks, and allows for efficient supervised fine-tuning by converting pixel data into rich semantic representations. The framework aims to bridge the gap between visual and textual modalities through a unified multimodal pretraining approach.
Daily-Omni is introduced as a new benchmark for audio-visual reasoning, featuring 684 videos and 1197 QA pairs across various tasks. The study highlights the challenges faced by current multimodal large language models in integrating audio and visual information, while demonstrating that combining visual and audio models with temporal alignment techniques can enhance performance. The paper also presents a QA generation pipeline to improve efficiency and scalability in evaluation.
The article discusses the integration of multimodal large language models (LLMs) into various applications, highlighting their ability to process and generate content across different modalities such as text, images, and audio. It emphasizes the advancements in model architectures and training techniques that enhance the performance and versatility of these models in real-world scenarios. Additionally, the piece explores potential use cases and the impact of multimodal capabilities on industries and user interactions.
Gemini Robotics 1.5 introduces advanced AI models that enable robots to perceive, plan, and execute complex tasks in the physical world. The models enhance a robot's ability to reason, learn across different embodiments, and interact naturally, marking a significant step towards achieving artificial general intelligence (AGI) in robotics. Developers can access these capabilities through the Gemini API in Google AI Studio.
Google Search has introduced a significant update to its AI Mode, allowing users to conduct visual searches more naturally by asking questions conversationally or uploading images. This update enhances shopping experiences by providing relevant visual results based on user descriptions, supported by a robust Shopping Graph that refreshes product listings frequently. The new features leverage advanced visual understanding and multimodal capabilities to refine search results and improve user engagement.
DoorDash has launched robot deliveries in Los Angeles and Chicago through a partnership with Coco Robotics, allowing eligible customers to receive deliveries from over 600 merchants. This initiative is part of DoorDash's strategy to incorporate multimodal delivery options, which include human workers, drones, and autonomous robots, aiming to reduce costs and environmental impact.
Google has released updated versions of the Gemini 2.5 Flash and Flash-Lite models, enhancing quality and efficiency with significant reductions in output tokens and improved capabilities in instruction following, conciseness, and multimodal functions. The updates aim to facilitate better performance in complex applications while allowing users to easily access the latest models through new aliases.
The article focuses on multimodal data analytics, emphasizing its significance in extracting insights from various types of data sources, such as text, images, and audio. It provides practical guidance on methodologies and tools that can be employed to leverage multimodal data for enhanced decision-making and predictive analytics. The content underscores the importance of integrating different modalities to improve the accuracy and depth of data analysis.
Google has introduced AI Mode in Search, enhancing user experience with advanced reasoning and multimodal capabilities, allowing for deeper inquiries and personalized responses. The new features include Deep Search for thorough research, live interaction with visual search, agentic capabilities for task management, and tailored suggestions based on user context. These updates aim to transform Google Search from a mere information tool to a comprehensive intelligence platform.
MingTok introduces the first continuous unified tokenizer for vision, enabling seamless integration of image understanding and generation within a single framework. This innovation leads to 3.5x faster convergence by aligning semantic understanding and generative dynamics, allowing for efficient multi-turn interactions without the costly detours seen in previous models. Ming-UniVision, built on MingTok, effectively harmonizes these tasks, paving the way for more intuitive multimodal AI systems.
Multimodal learning faces challenges when modalities differ between development and deployment due to various factors, including perceived informativeness and missing data. The framework ICYM2I (In Case You Multimodal Missed It) is introduced to address biases in estimating information gain from modalities under missingness, using inverse probability weighting-based correction. The effectiveness of this approach is demonstrated through synthetic and real-world medical datasets.
The paper presents BLIP3-o, a family of fully open unified multimodal models that enhance both image understanding and generation. It introduces a diffusion transformer for generating CLIP image features, advocates for a sequential pretraining strategy, and proposes a high-quality dataset, BLIP3o-60k, to improve performance across various benchmarks. The models, along with code and datasets, are open-sourced to foster further research.
AMIE, a multimodal conversational AI agent developed by Google DeepMind, has been enhanced to intelligently request and interpret visual medical information during clinical dialogues, emulating the structured history-taking of experienced clinicians. Evaluations show that AMIE can match or exceed primary care physicians in diagnostic accuracy and empathy while utilizing multimodal data effectively in simulated consultations. Ongoing research aims to further refine AMIE's capabilities using advanced models and assess its performance in real-world clinical settings.
Gemini models 2.5 Pro and Flash are revolutionizing robotics with advanced coding, reasoning, and multimodal capabilities, enhancing robots' spatial understanding. Developers can utilize these models and the Live API for applications such as semantic scene understanding, spatial reasoning, and interactive robotics, enabling robots to execute complex tasks through voice commands and code generation. The article highlights practical examples and the potential of Gemini's embodied reasoning model in various robotics applications.
LMEval, an open-source framework developed by Google, simplifies the evaluation of large language models across various providers by offering multi-provider compatibility, incremental evaluation, and multimodal support. With features like a self-encrypting database and an interactive visualization tool called LMEvalboard, it enhances the benchmarking process, making it easier for developers and researchers to assess model performance efficiently.
Mini-o3 introduces an advanced system that enhances tool-based interactions for visual reasoning by supporting deep, multi-turn reasoning and achieving state-of-the-art performance on visual search tasks. The system utilizes a novel over-turn masking strategy to effectively manage response lengths during reinforcement learning, combined with a comprehensive dataset designed for exploratory reasoning. Open-source code and models are provided to facilitate reproducibility and further research.
HunyuanImage-3.0 has been released as an open-source image generation model, featuring a unified multimodal architecture that integrates text and image understanding. It boasts the largest Mixture of Experts model with 80 billion parameters, enabling superior image generation capabilities while supporting extensive customization through various checkpoints and performance optimizations.
Voxtral Mini and Voxtral Small are two multimodal audio chat models designed to understand both spoken audio and text. They achieve state-of-the-art performance on various audio benchmarks while maintaining strong text capabilities, with Voxtral Small being efficient enough for local deployment. The models include a 32K context window for processing lengthy audio and multi-turn conversations and come with three new benchmarks for evaluating speech understanding in knowledge and trivia.
LLaMA 4 introduces advanced multimodal intelligence capabilities that enhance user interactions by integrating various data types such as text, images, and audio. The model aims to improve understanding and generation across different modalities, making it more versatile for practical applications in AI. Key features include refined training techniques and a focus on user-centric design to facilitate more intuitive AI experiences.
Google DeepMind has unveiled the Gemini Robotics models, which enhance robots' capabilities to perform complex tasks through natural language understanding and dexterity. These multimodal models allow robots to adapt to various environments and instructions, paving the way for future applications in everyday life and industry. Carolina Parada emphasizes the potential of embodied AI to transform how robots assist with daily tasks.
The repository provides an implementation of the method "Learning Compact Vision Tokens for Efficient Large Multimodal Models," which enhances inference efficiency by fusing spatial-adjacent vision tokens and introducing a Multi-Block Token Fusion module. Experimental results show that this approach achieves competitive performance on various vision-language benchmarks while using only 25% of the baseline vision tokens.
The VideoChat-R1.5 model has been released on Huggingface, showcasing improved capabilities in spatio-temporal perception and reasoning through multi-task joint reinforcement learning. It has been accepted at NIPS2025 and builds on previous versions, enhancing video reasoning across various applications. The model utilizes hierarchical human attention during inference for better localization of regions of interest in videos.
Gemini 2.5 Pro Preview has been released ahead of schedule, featuring enhanced capabilities for coding and building interactive web apps. This update builds on positive feedback from the previous version, improving performance in UI development, code transformation, and multimodal reasoning, and now leads the WebDev Arena Leaderboard. Developers can access these features through the Gemini API and Google AI Studio.
Meta's Llama 4 models, including Llama 4 Scout 17B and Llama 4 Maverick 17B, are now available in Amazon Bedrock as a serverless solution, offering advanced multimodal capabilities for applications. These models leverage a mixture-of-experts architecture to enhance performance and support a wide range of use cases, from enterprise applications to customer support and content creation. Users can easily integrate these models into their applications using the Amazon Bedrock Converse API.
Site owners and content creators are encouraged to focus on providing unique and valuable content to succeed in Google's AI search experiences. Key strategies include ensuring a great page experience, meeting technical requirements, managing visibility, and adapting to evolving user needs in search behavior. Emphasizing multimodal content and understanding visitor engagement are also crucial for maximizing the value of search traffic.
Kimi-VL is an open-source Mixture-of-Experts vision-language model that excels in multimodal reasoning and long-context understanding with only 2.8B activated parameters. It demonstrates superior performance in various tasks such as multi-turn interactions, video comprehension, and mathematical reasoning, competing effectively with larger models while maintaining efficiency. The latest variant, Kimi-VL-A3B-Thinking-2506, enhances reasoning and visual perception capabilities, achieving state-of-the-art results in several benchmarks.
Google has introduced Gemma 3n, a new open model designed for optimized on-device AI performance, enabling real-time processing on mobile devices. Built on a cutting-edge architecture in collaboration with hardware leaders, Gemma 3n features advanced capabilities like multimodal understanding, improved multilingual support, and innovations that reduce memory usage. Developers can access a preview of this model now to start building efficient AI applications.
Complete the intermediate course on implementing multimodal vector search with BigQuery, which takes 1 hour and 45 minutes. Participants will learn to use Gemini for SQL generation, conduct sentiment analysis, summarize text, generate embeddings, create a Retrieval Augmented Generation (RAG) pipeline, and perform multimodal vector searches.
Join Javier Hernandez in a webinar on April 24th to explore how HP's AI Studio utilizes multimodal large language models to analyze diverse medical data formats, including text, images, and audio. This session will cover the creation of real-world applications, challenges faced, and strategies for enhancing data-driven decision-making in medical research and diagnostics.
Qwen3-Omni is a cutting-edge multilingual omni-modal foundation model capable of processing text, images, audio, and video, providing real-time streaming responses. It features significant architectural advancements for performance, supports 119 text languages, and offers various applications through detailed cookbooks, including speech recognition, audio captioning, and video analysis. The model is available for use via Hugging Face and ModelScope, with recommendations for optimal performance.
OpenAI's latest models, o3 and o4-mini, enhance visual reasoning capabilities by enabling the integration of image processing within their chain-of-thought, allowing for more thorough analyses and problem-solving. These advancements significantly outperform previous models across various multimodal benchmarks, marking a crucial step in multimodal reasoning.
Google AI Studio has introduced new features and capabilities for developers using the Gemini API, including enhanced code generation with Gemini 2.5 Pro, multimodal media generation, and improved deployment options via Cloud Run. The platform supports interactive app development and offers advanced audio dialogue and text-to-speech functionalities, making it easier to build intuitive, AI-powered applications. Additional tools like the Model Context Protocol and URL Context are also available for deeper integration and content retrieval.
Meta has launched Llama 4, introducing two new AI models, Llama 4 Scout and Llama 4 Maverick, now available for use in WhatsApp, Messenger, and Instagram. The Maverick model is designed for general assistant tasks and excels in image and text understanding, while Scout focuses on multi-document summarization and personalized tasks. Additionally, Meta is set to release a third model, Llama 4 Behemoth, with a significant number of parameters, and another model, Llama 4 Reasoning, in the near future.
VARGPT-v1.1 is a powerful multimodal model that enhances visual understanding and generation capabilities through iterative instruction tuning and reinforcement learning. It includes extensive code releases for training, inference, and evaluation, as well as a comprehensive structure for multimodal tasks such as image captioning and visual question answering. The model's checkpoints and datasets are available on Hugging Face, facilitating further research and application development.
Ollama has launched a new app for macOS and Windows that allows users to chat with models, process files through drag and drop, and utilize a multimodal engine for image interaction. The app also supports increased context length for handling larger documents and provides options for documentation writing. Users can download the app or access CLI versions from Ollama's GitHub releases page.
InternVL3.5 introduces a new family of open-source multimodal models that enhance versatility, reasoning capabilities, and inference efficiency. A key innovation is the Cascade Reinforcement Learning framework, which improves reasoning tasks significantly while a Visual Resolution Router optimizes visual token resolution. The model achieves notable performance gains and supports advanced capabilities like GUI interaction and embodied agency, positioning it competitively against leading commercial models.
Vision Language Models (VLMs) have evolved significantly over the past year, showcasing advancements in any-to-any architectures, reasoning capabilities, and the emergence of multimodal agents. New trends include smaller yet powerful models, innovative alignment techniques, and the introduction of Vision-Language-Action models that enhance robotic interactions. The article highlights key developments and model recommendations in the rapidly growing field of VLMs.
Command A Vision is a state-of-the-art vision-language model designed for business applications, excelling in multimodal tasks such as document OCR and image analysis. With a 112B parameter architecture, it outperforms competitors like GPT-4.1 and Llama 4 Maverick on various benchmarks, making it a powerful tool for enterprises seeking to automate processes and enhance decision-making. The model is available with open weights for community use.
3D CoCa is a unified framework for 3D captioning that integrates contrastive vision-language learning with 3D caption generation. By leveraging a frozen CLIP backbone and a spatially-aware 3D scene encoder, it jointly optimizes contrastive and captioning objectives in a shared feature space, leading to improved spatial reasoning and semantic grounding. Extensive experiments show that 3D CoCa surpasses existing methods, achieving significant performance gains on benchmark datasets.
ScreenCoder is an advanced UI-to-code generation system that converts screenshots or design mockups into production-ready HTML/CSS code using a modular multi-agent architecture. It facilitates easy customization and rapid prototyping, bridging the gap between design and development. The project includes a demo, benchmark dataset, and detailed instructions for setup and usage.
The article introduces the Pico-Banana-400K dataset, a large-scale collection of 400,000 images designed for text-guided image editing. It aims to address the limitations in existing datasets by providing high-quality, diverse edit pairs generated from real photographs, facilitating advanced research in multimodal image editing techniques. The dataset includes specialized subsets for multi-turn editing, preference research, and instruction summarization.