Click any tag below to further narrow down your results
+ machine-learning
(6)
+ deep-learning
(5)
+ multimodal
(3)
+ ai
(3)
+ artificial-intelligence
(2)
+ image-processing
(2)
+ technology
(2)
+ image-generation
(2)
+ image-segmentation
(2)
+ meta
(2)
+ 3d-reconstruction
(2)
+ 3d-modeling
(2)
+ depth-estimation
(2)
+ open-source
(2)
+ generative-models
(2)
Links
ShapeR offers a method for generating 3D shapes from image sequences. It processes input images to extract relevant data, then uses a transformer model to create a mesh representation of each object in the scene. The project includes tools for setup, data exploration, and evaluation.
Meta has released SAM 3 and SAM 3D, new image segmentation models that enhance object recognition and enable 3D reconstruction of images. SAM 3 allows users to edit images through detailed text prompts, while SAM 3D can rebuild objects and people in 3D. Both models aim to improve creative applications and user interactions in various digital environments.
Depth Anything 3 (DA3) is a model designed for accurate depth estimation and 3D geometry recovery from various visual inputs, regardless of camera pose. It simplifies the process using a single transformer backbone and a depth-ray representation, outperforming previous models in both monocular and multi-view scenarios. Various specialized models within the DA3 series cater to different depth estimation tasks.
The article explains how optical character recognition (OCR) models, like deepseek-ocr, process images of text into machine-readable formats. It details the roles of the encoder and decoder in transforming visual data into structured text while highlighting the advancements in learning techniques that reduce the need for manual coding.
Meta has introduced Segment Anything Model 3 (SAM 3), which enhances object detection, segmentation, and tracking in images and videos using text and visual prompts. The release includes model checkpoints, a new playground for experimentation, and applications in platforms like Facebook Marketplace and Instagram's Edits app. SAM 3 also features a data engine that combines AI and human annotators to speed up image and video annotation.
The early days of computer vision saw significant innovation despite memory constraints, exemplified by the Efficient Chain-Linking Algorithm developed at Inria in the late 1980s. This algorithm showcases how to process images efficiently by dynamically linking pixel chains while minimizing memory usage, a technique that remains relevant even with modern advancements in computer vision. The preservation of this legacy code is part of a broader initiative to archive important historical software from Inria.
A novel photograph relighting method allows users to control various light sources with physical accuracy, integrating traditional and neural rendering techniques. By employing a self-supervised training approach, the system reconstructs scene illumination from real-world images, facilitating in-the-wild relighting applications akin to those in 3D computer graphics tools.
Pippo is a generative model designed to create high-resolution dense turnaround videos of individuals from a single casual photograph, utilizing a multi-view diffusion transformer without the need for additional inputs. The codebase includes training configurations for various resolutions, sample training code, and methods for preparing custom datasets. Future updates are planned to enhance the functionality and usability of the model.
FARMER is a novel generative framework that integrates Normalizing Flows and Autoregressive models for effective likelihood estimation and high-quality image synthesis directly from raw pixel data. It incorporates an invertible autoregressive flow to convert images into latent sequences and employs a self-supervised dimension reduction method to optimize the modeling process. Experimental results show that FARMER achieves competitive performance compared to existing models while ensuring exact likelihoods and scalable training.
MaskMark is a novel framework for image watermarking that offers two variants: MaskMark-D for global and local watermark extraction, and MaskMark-ED for enhanced robustness in localized areas. It employs a masking mechanism during the decoding and encoding stages to improve accuracy and adaptability while maintaining high visual quality. Experimental results demonstrate its superior performance over existing models, requiring significantly less computational cost.
SVAD is a novel method for creating high-quality 3D human avatars from a single image by combining the strengths of video diffusion models and 3D Gaussian Splatting techniques. It generates synthetic training data, enhances identity preservation, and enables real-time rendering, outperforming existing single-image methods in maintaining consistency and detail. Evaluations show SVAD's effectiveness in generating robust avatars from diverse sources while allowing for text-guided editing of avatar attributes.
PixelFlow introduces a novel approach to image generation by operating directly in raw pixel space, eliminating the need for pre-trained Variational Autoencoders. This method enhances the image generation process with efficient cascade flow modeling, achieving a competitive FID score of 1.98 on the ImageNet benchmark while offering high-quality and semantically controlled image outputs. The work aims to inspire future developments in visual generation models.
The paper presents BLIP3-o, a family of fully open unified multimodal models that enhance both image understanding and generation. It introduces a diffusion transformer for generating CLIP image features, advocates for a sequential pretraining strategy, and proposes a high-quality dataset, BLIP3o-60k, to improve performance across various benchmarks. The models, along with code and datasets, are open-sourced to foster further research.
An Apple study explores the potential of analyzing 3D objects from images, highlighting advancements in computer vision technology. The research aims to improve the accuracy and efficiency of 3D modeling, which could have significant applications in various fields such as augmented reality and design.
The article discusses the development of DINOv3, a self-supervised vision model that enhances understanding of visual data without the need for labeled datasets. It elaborates on its architecture, training methods, and potential applications in various fields, showcasing improvements over previous iterations in accuracy and efficiency.
LSNet is a new family of lightweight vision models that leverage a "See Large, Focus Small" strategy, inspired by the human visual system, to improve efficiency and performance in various vision tasks. Utilizing LS convolution, which combines large-kernel perception with small-kernel aggregation, LSNet outperforms existing lightweight networks while maintaining computational efficiency. The models have been trained on ImageNet-1K and tested on a Nvidia RTX3090 for throughput.
InteractVLM is a new method for estimating 3D contact points on human bodies and objects from single images, addressing challenges like occlusions and depth ambiguities. It combines Vision-Language Models and a Render-Localize-Lift module to enhance 3D reconstruction and introduces a Semantic Human Contact estimation task for improved interaction modeling. The approach outperforms existing methods and is scalable due to its reliance on limited 3D contact data.
CUPS is a novel Scene-Centric Unsupervised Panoptic Segmentation method that utilizes motion and depth from stereo pairs to create high-resolution pseudo-labels for training a monocular panoptic network. This approach allows for the effective segmentation of complex scenes without the need for annotated data, achieving superior performance compared to existing unsupervised methods, particularly on benchmarks like Cityscapes. CUPS demonstrates strong generalization capabilities across multiple datasets while significantly enhancing panoptic quality metrics.
The article discusses advancements in image segmentation techniques, particularly focusing on the Gemini model and its implications for various applications in computer vision. It highlights the improvements in accuracy and efficiency over previous models, as well as the potential for broader use in sectors such as healthcare and autonomous vehicles.
The article discusses advancements in computer vision technology, focusing on its applications in various industries, such as healthcare and automotive. It highlights the importance of machine learning and artificial intelligence in enhancing the accuracy and efficiency of visual recognition systems. The potential future developments in this field are also explored, emphasizing the transformative impact on society.
Personalized image synthesis through text-to-image generation is explored using auto-regressive models, which have been less studied compared to diffusion models. The paper presents a two-stage training strategy that optimizes text embeddings and fine-tunes transformer layers, demonstrating that auto-regressive models can achieve comparable fidelity and prompt adherence to existing methods. This research opens new avenues for improving personalized image generation techniques.
InternVL3.5 introduces a new family of open-source multimodal models that enhance versatility, reasoning capabilities, and inference efficiency. A key innovation is the Cascade Reinforcement Learning framework, which improves reasoning tasks significantly while a Visual Resolution Router optimizes visual token resolution. The model achieves notable performance gains and supports advanced capabilities like GUI interaction and embodied agency, positioning it competitively against leading commercial models.
The Low-to-high Multi-Level Transformer (LMLT) introduces a novel approach for image super-resolution that reduces the complexity and inference time associated with existing Vision Transformer models. By employing attention mechanisms with varying feature sizes and integrating results from lower heads into higher heads, LMLT effectively captures both local and global information, mitigating issues related to window boundaries in self-attention. Experimental results indicate that LMLT outperforms state-of-the-art methods while significantly reducing GPU memory usage.
+ image-super-resolution
+ vision-transformer
+ attention-mechanism
+ deep-learning
computer-vision ✓
3D CoCa is a unified framework for 3D captioning that integrates contrastive vision-language learning with 3D caption generation. By leveraging a frozen CLIP backbone and a spatially-aware 3D scene encoder, it jointly optimizes contrastive and captioning objectives in a shared feature space, leading to improved spatial reasoning and semantic grounding. Extensive experiments show that 3D CoCa surpasses existing methods, achieving significant performance gains on benchmark datasets.
MindJourney is a new research framework that enables AI agents to explore simulated 3D environments, improving their spatial interpretation capabilities. By using a world model and a spatial beam search algorithm, MindJourney allows AI to generate multiple perspectives of a scene, enhancing its ability to answer spatial questions without additional training. This approach significantly boosts the performance of vision-language models, suggesting potential applications in robotics and smart technologies.