3 links
tagged with all of: multimodal + computer-vision
Click any tag below to further narrow down your results
Links
The paper presents BLIP3-o, a family of fully open unified multimodal models that enhance both image understanding and generation. It introduces a diffusion transformer for generating CLIP image features, advocates for a sequential pretraining strategy, and proposes a high-quality dataset, BLIP3o-60k, to improve performance across various benchmarks. The models, along with code and datasets, are open-sourced to foster further research.
InternVL3.5 introduces a new family of open-source multimodal models that enhance versatility, reasoning capabilities, and inference efficiency. A key innovation is the Cascade Reinforcement Learning framework, which improves reasoning tasks significantly while a Visual Resolution Router optimizes visual token resolution. The model achieves notable performance gains and supports advanced capabilities like GUI interaction and embodied agency, positioning it competitively against leading commercial models.
3D CoCa is a unified framework for 3D captioning that integrates contrastive vision-language learning with 3D caption generation. By leveraging a frozen CLIP backbone and a spatially-aware 3D scene encoder, it jointly optimizes contrastive and captioning objectives in a shared feature space, leading to improved spatial reasoning and semantic grounding. Extensive experiments show that 3D CoCa surpasses existing methods, achieving significant performance gains on benchmark datasets.