NVIDIA Releases Nemotron 3 Super Open Model; Achieves 5x Throughput Gain for Agentic AI

Overview

NVIDIA has released Nemotron 3 Super, an open-source large language model designed specifically for multi-agent AI systems. The model is a 120B total parameter model with only 12B actively used parameters at inference time, delivering a novel balance between capability and efficiency for complex reasoning tasks.

Key Architectural Innovations

Nemotron 3 Super introduces several architectural advances to address limitations in agentic AI:

Latent MoE (Mixture of Experts): Uses 4x as many expert specialists for the same inference cost through token compression before reaching experts
Multi-Token Prediction (MTP): Predicts multiple future tokens in a single forward pass, reducing generation time for long sequences and enabling built-in speculative decoding
Hybrid Mamba-Transformer Backbone: Integrates Mamba-2 layers for efficient sequence processing with Transformer attention layers for precision reasoning, delivering 4x improved memory and compute efficiency
Native NVFP4 Pretraining: Optimized for NVIDIA Blackwell hardware, achieving 4x faster inference on B200 compared to FP8 on H100 while maintaining accuracy
Multi-Environment Reinforcement Learning: Post-trained with RL across 21 environment configurations using NVIDIA NeMo tools, with over 1.2 million environment rollouts

Performance and Capabilities

The model addresses two critical challenges in agentic AI: the "thinking tax" (expensive sub-task reasoning) and "context explosion" (exponential token growth in multi-agent systems). Nemotron 3 Super delivers over 5x throughput compared to the previous Nemotron Super while maintaining accuracy for dense technical problems.

On PinchBench—a new benchmark for evaluating LLMs as the brain of autonomous agents—Nemotron 3 Super scores 85.6% across the full test suite, making it the best-performing open model in its category. The native 1M-token context window provides agents with long-term memory for sustained, aligned reasoning across complex tasks.

Open Access and Deployment

The model is fully open-source with open weights, datasets, and recipes. Developers can download the model from Hugging Face (nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8), customize it for their specific applications, and deploy it on their own infrastructure without licensing restrictions.

Overview

Key Architectural Innovations

Performance and Capabilities

Open Access and Deployment

Tags

Published

Source