6 links
tagged with all of: multimodal + artificial-intelligence
Click any tag below to further narrow down your results
Links
Google has launched two new models in the Gemini family, Gemini 2.5 Pro and Gemini 2.5 Flash, which significantly enhance video understanding capabilities. The Pro model achieves state-of-the-art performance in various benchmarks and enables innovative applications like interactive learning tools and dynamic animations from video content. Both models facilitate advanced video processing and offer cost-effective solutions for diverse use cases in education and content creation.
SpatialScore introduces a comprehensive benchmark for evaluating multimodal large language models (MLLMs) in spatial understanding, consisting of the VGBench dataset and an extensive collection of 28K samples. It features the SpatialAgent, a multi-agent system designed for enhanced spatial reasoning, and reveals persistent challenges and improvements in spatial tasks through quantitative and qualitative evaluations.
Gemini Robotics 1.5 introduces advanced AI models that enable robots to perceive, plan, and execute complex tasks in the physical world. The models enhance a robot's ability to reason, learn across different embodiments, and interact naturally, marking a significant step towards achieving artificial general intelligence (AGI) in robotics. Developers can access these capabilities through the Gemini API in Google AI Studio.
HunyuanImage-3.0 has been released as an open-source image generation model, featuring a unified multimodal architecture that integrates text and image understanding. It boasts the largest Mixture of Experts model with 80 billion parameters, enabling superior image generation capabilities while supporting extensive customization through various checkpoints and performance optimizations.
LLaMA 4 introduces advanced multimodal intelligence capabilities that enhance user interactions by integrating various data types such as text, images, and audio. The model aims to improve understanding and generation across different modalities, making it more versatile for practical applications in AI. Key features include refined training techniques and a focus on user-centric design to facilitate more intuitive AI experiences.
Google DeepMind has unveiled the Gemini Robotics models, which enhance robots' capabilities to perform complex tasks through natural language understanding and dexterity. These multimodal models allow robots to adapt to various environments and instructions, paving the way for future applications in everyday life and industry. Carolina Parada emphasizes the potential of embodied AI to transform how robots assist with daily tasks.