NVIDIA releases Nemotron ColEmbed V2 multimodal models; 8B variant ranks #1 on ViDoRe V3 benchmark

New Multimodal Embedding Models

NVIDIA introduces the Nemotron ColEmbed V2 family, a set of late-interaction embedding models available in three sizes—3B, 4B, and 8B parameters. These models are specifically designed for accurate multimodal retrieval across heterogeneous documents containing text, images, tables, charts, and other visual components.

Architecture and Approach

Unlike single-vector embedding approaches, Nemotron ColEmbed V2 adopts a late-interaction architecture inspired by ColBERT. This enables fine-grained token-level interactions between queries and documents:

Each query token embedding interacts with all document token embeddings using the MaxSim operator
The operator selects maximum similarity for each query token and sums them for a final relevance score
Models use bi-directional self-attention (instead of causal) for richer representation learning
Supports both textual and visual token interactions

Performance and Benchmarks

The models achieve state-of-the-art results on the ViDoRe V3 benchmark for visual document retrieval:

nemotron-colembed-vl-8b-v2: Ranks #1 with 63.42 NDCG@10 (8.8B parameters)
nemotron-colembed-vl-4b-v2: Ranks #3 with 61.54 NDCG@10 (4.8B parameters)
llama-nemotron-colembed-vl-3b-v2: Ranks #6 with 59.79 NDCG@10 (4.4B parameters)

Use Cases and Distinction

These models are intended for researchers and enterprises prioritizing accuracy in visual document retrieval applications, distinguishing them from NVIDIA's earlier 1B single-vector model optimized for efficiency and storage. Key applications include:

Multimodal RAG systems with textual queries retrieving document images
Multimedia search engines and cross-modal retrieval
Conversational AI with rich input understanding
Enterprise document processing systems

All three models are now available on Hugging Face.

New Multimodal Embedding Models

Architecture and Approach

Performance and Benchmarks

Use Cases and Distinction

Tags

Published

Source