New Multimodal Embedding Models
NVIDIA introduces the Nemotron ColEmbed V2 family, a set of late-interaction embedding models available in three sizes—3B, 4B, and 8B parameters. These models are specifically designed for accurate multimodal retrieval across heterogeneous documents containing text, images, tables, charts, and other visual components.
Architecture and Approach
Unlike single-vector embedding approaches, Nemotron ColEmbed V2 adopts a late-interaction architecture inspired by ColBERT. This enables fine-grained token-level interactions between queries and documents:
- Each query token embedding interacts with all document token embeddings using the MaxSim operator
- The operator selects maximum similarity for each query token and sums them for a final relevance score
- Models use bi-directional self-attention (instead of causal) for richer representation learning
- Supports both textual and visual token interactions
Performance and Benchmarks
The models achieve state-of-the-art results on the ViDoRe V3 benchmark for visual document retrieval:
- nemotron-colembed-vl-8b-v2: Ranks #1 with 63.42 NDCG@10 (8.8B parameters)
- nemotron-colembed-vl-4b-v2: Ranks #3 with 61.54 NDCG@10 (4.8B parameters)
- llama-nemotron-colembed-vl-3b-v2: Ranks #6 with 59.79 NDCG@10 (4.4B parameters)
Use Cases and Distinction
These models are intended for researchers and enterprises prioritizing accuracy in visual document retrieval applications, distinguishing them from NVIDIA's earlier 1B single-vector model optimized for efficiency and storage. Key applications include:
- Multimodal RAG systems with textual queries retrieving document images
- Multimedia search engines and cross-modal retrieval
- Conversational AI with rich input understanding
- Enterprise document processing systems
All three models are now available on Hugging Face.