1 link tagged with all of: computer-vision + deep-learning + open-source + multimodal + image-generation
Links
The paper presents BLIP3-o, a family of fully open unified multimodal models that enhance both image understanding and generation. It introduces a diffusion transformer for generating CLIP image features, advocates for a sequential pretraining strategy, and proposes a high-quality dataset, BLIP3o-60k, to improve performance across various benchmarks. The models, along with code and datasets, are open-sourced to foster further research.
multimodal ✓
image-generation ✓
computer-vision ✓
deep-learning ✓
open-source ✓