3 links
tagged with all of: multimodal + dataset
Click any tag below to further narrow down your results
Links
OmniSVG is a unified framework for generating high-quality scalable vector graphics (SVG) using pre-trained Vision-Language Models (VLMs), which decouples structural logic from low-level geometry. It introduces the MMSVG-2M dataset with two million annotated SVG assets and supports multiple generation modalities, demonstrating superior performance over existing methods for diverse creative tasks. The model is designed to handle complexity ranging from simple icons to intricate illustrations, offering flexibility for professional design workflows.
Mini-o3 introduces an advanced system that enhances tool-based interactions for visual reasoning by supporting deep, multi-turn reasoning and achieving state-of-the-art performance on visual search tasks. The system utilizes a novel over-turn masking strategy to effectively manage response lengths during reinforcement learning, combined with a comprehensive dataset designed for exploratory reasoning. Open-source code and models are provided to facilitate reproducibility and further research.
The article introduces the Pico-Banana-400K dataset, a large-scale collection of 400,000 images designed for text-guided image editing. It aims to address the limitations in existing datasets by providing high-quality, diverse edit pairs generated from real photographs, facilitating advanced research in multimodal image editing techniques. The dataset includes specialized subsets for multi-turn editing, preference research, and instruction summarization.