← Back
NVIDIA releases Nemotron ColEmbed V2 multimodal models; 8B variant ranks #1 on ViDoRe V3 benchmark
· releasemodelfeature · huggingface.co ↗

New Multimodal Embedding Models

NVIDIA introduces the Nemotron ColEmbed V2 family, a set of late-interaction embedding models available in three sizes—3B, 4B, and 8B parameters. These models are specifically designed for accurate multimodal retrieval across heterogeneous documents containing text, images, tables, charts, and other visual components.

Architecture and Approach

Unlike single-vector embedding approaches, Nemotron ColEmbed V2 adopts a late-interaction architecture inspired by ColBERT. This enables fine-grained token-level interactions between queries and documents:

  • Each query token embedding interacts with all document token embeddings using the MaxSim operator
  • The operator selects maximum similarity for each query token and sums them for a final relevance score
  • Models use bi-directional self-attention (instead of causal) for richer representation learning
  • Supports both textual and visual token interactions

Performance and Benchmarks

The models achieve state-of-the-art results on the ViDoRe V3 benchmark for visual document retrieval:

  • nemotron-colembed-vl-8b-v2: Ranks #1 with 63.42 NDCG@10 (8.8B parameters)
  • nemotron-colembed-vl-4b-v2: Ranks #3 with 61.54 NDCG@10 (4.8B parameters)
  • llama-nemotron-colembed-vl-3b-v2: Ranks #6 with 59.79 NDCG@10 (4.4B parameters)

Use Cases and Distinction

These models are intended for researchers and enterprises prioritizing accuracy in visual document retrieval applications, distinguishing them from NVIDIA's earlier 1B single-vector model optimized for efficiency and storage. Key applications include:

  • Multimodal RAG systems with textual queries retrieving document images
  • Multimedia search engines and cross-modal retrieval
  • Conversational AI with rich input understanding
  • Enterprise document processing systems

All three models are now available on Hugging Face.