4 links
tagged with all of: deep-learning + computer-vision
Click any tag below to further narrow down your results
Links
MaskMark is a novel framework for image watermarking that offers two variants: MaskMark-D for global and local watermark extraction, and MaskMark-ED for enhanced robustness in localized areas. It employs a masking mechanism during the decoding and encoding stages to improve accuracy and adaptability while maintaining high visual quality. Experimental results demonstrate its superior performance over existing models, requiring significantly less computational cost.
The paper presents BLIP3-o, a family of fully open unified multimodal models that enhance both image understanding and generation. It introduces a diffusion transformer for generating CLIP image features, advocates for a sequential pretraining strategy, and proposes a high-quality dataset, BLIP3o-60k, to improve performance across various benchmarks. The models, along with code and datasets, are open-sourced to foster further research.
The article discusses the development of DINOv3, a self-supervised vision model that enhances understanding of visual data without the need for labeled datasets. It elaborates on its architecture, training methods, and potential applications in various fields, showcasing improvements over previous iterations in accuracy and efficiency.
The Low-to-high Multi-Level Transformer (LMLT) introduces a novel approach for image super-resolution that reduces the complexity and inference time associated with existing Vision Transformer models. By employing attention mechanisms with varying feature sizes and integrating results from lower heads into higher heads, LMLT effectively captures both local and global information, mitigating issues related to window boundaries in self-attention. Experimental results indicate that LMLT outperforms state-of-the-art methods while significantly reducing GPU memory usage.