4 min read
|
Saved February 14, 2026
|
Copied!
Do you care about this?
NVIDIA has released the Nemotron ColEmbed V2 models, designed for efficient multimodal document retrieval. These models utilize a late-interaction embedding approach to improve accuracy in handling text, images, and structured visual data. They perform well on the ViDoRe V3 benchmark, making them suitable for applications like multimedia search engines and conversational AI.
If you do, here's more
NVIDIA has launched the Nemotron ColEmbed V2, a suite of late-interaction multimodal embedding models aimed at improving document retrieval that combines text and images. The models come in three sizes: 3B, 4B, and 8B parameters. They excel in retrieving information from complex document types, achieving top rankings in the ViDoRe V3 benchmark, with the 8B model scoring 63.42 in NDCG@10, the 4B at 61.54, and the 3B at 59.79. This performance positions them as state-of-the-art solutions in their respective categories.
The late interaction mechanism allows for a more nuanced comparison between query and document tokens. Each token's embedding interacts with all document tokens, maximizing similarity scores to produce a final relevance score. This method improves accuracy but requires significant storage for token embeddings. In contrast to the previous single-vector models, the ColEmbed V2 models output multi-vector embeddings, enhancing their capability for applications like multimedia search engines and conversational AI.
The training process involved contrastive learning with both text-only and text-image pairs to maximize the similarity between relevant documents and queries. The models have also undergone advanced model merging and utilized a diverse array of synthetic multilingual data to improve performance across different languages and document types. For those interested in multimodal retrieval, the models are available on Hugging Face along with example notebooks to get started.
Questions about this article
No questions yet.