NVIDIA releases Nemotron 3 Super, 120B open model with 1M-token context for agentic AI

Overview

NVIDIA introduces Nemotron 3 Super, a new open-source large language model engineered specifically for agentic AI applications that require sustained, long-running autonomous reasoning. The 120B total parameter model (with only 12B actively used per token) addresses critical limitations in existing models when deployed as agent brains in multi-agent systems.

Key Architectural Innovations

Nemotron 3 Super introduces several novel architectural techniques:

Hybrid Mamba-Transformer backbone: Combines Mamba state-space model layers for sequence efficiency with interleaved Transformer attention layers for precision reasoning, achieving 4x improved memory and compute efficiency.
Latent Mixture-of-Experts (MoE): Compresses tokens before routing to experts, enabling 4x more expert specialization without increasing inference cost.
Multi-token prediction (MTP): Predicts multiple future tokens in a single forward pass, dramatically reducing generation time and enabling built-in speculative decoding.
Native NVFP4 pretraining: Optimized for NVIDIA Blackwell hardware, cutting memory requirements and achieving 4x inference speedup on B200 compared to FP8 on H100.
Multi-environment RL post-training: Enhanced with reinforcement learning across 21 environment configurations using NVIDIA NeMo Gym and NeMo RL, trained on 1.2+ million environment rollouts.

Performance and Capabilities

The model delivers over 5x throughput compared to the previous Nemotron Super and features a native 1M-token context window—critical for multi-agent systems that generate up to 15x the tokens of standard chats through history re-sending, tool outputs, and reasoning steps. This extended context helps maintain agent alignment and prevents "goal drift" during long-running tasks.

On PinchBench, a new benchmark for LLM-powered agent performance, Nemotron 3 Super achieves 85.6% accuracy, ranking as the best open model in its class for agentic reasoning, coding, and cybersecurity triaging applications.

Availability and Customization

The model is fully open with open weights, datasets, and training recipes, enabling developers to customize, optimize, and deploy on their own infrastructure. Weights are available on Hugging Face, with tutorial content and integration documentation provided by NVIDIA.

Developer Action Items

Access the model at Hugging Face
Review the architectural deep-dive and tutorial video on NVIDIA's developer blog
Explore integration with NVIDIA NeMo frameworks for fine-tuning and optimization
Test on agentic AI use cases involving long-context reasoning, software development, or cybersecurity workflows

Overview

Key Architectural Innovations

Performance and Capabilities

Availability and Customization

Developer Action Items

Tags

Published

Source