NVIDIA releases Nemotron 3 Super, open 120B model with 1M-token context for agentic AI

Addressing Multi-Agent Scaling Challenges

NVIDIA has introduced Nemotron 3 Super to solve fundamental efficiency and accuracy challenges in agentic AI systems. Multi-agent systems generate up to 15x more tokens than standard chats due to re-sending history, tool outputs, and reasoning steps at each turn. Over extended tasks, this "context explosion" causes goal drift where agents lose alignment with original objectives. The model is designed to handle these demands while remaining practical to deploy at scale.

Key Architectural Innovations

Nemotron 3 Super introduces several technical innovations that differentiate it from standard large language models:

Hybrid Mamba-Transformer backbone: Combines Mamba layers for linear-time sequence processing with interleaved Transformer attention layers for precise fact retrieval, delivering 4x improved memory and compute efficiency
Latent Mixture-of-Experts (MoE): Activates 4x as many expert specialists for the same inference cost by compressing tokens before they reach the experts
Multi-token prediction (MTP): Predicts multiple future tokens in a single forward pass, reducing generation time and enabling built-in speculative decoding
Native NVFP4 pretraining: Optimized for NVIDIA Blackwell hardware, cutting memory requirements and speeding up inference by 4x on B200 versus FP8 on H100
Multi-environment reinforcement learning: Post-trained across 21 environment configurations with over 1.2 million environment rollouts

Performance and Availability

The model demonstrates strong performance on agentic reasoning tasks, scoring 85.6% on PinchBench—a benchmark for evaluating LLM performance as the brain of autonomous agents—making it the best open model in its class. With 120B total parameters but only 12B active per token, Nemotron 3 Super delivers compute efficiency critical for long-running deployments while maintaining the depth needed for complex reasoning in software development, cybersecurity, and other technical domains.

The model is fully open with open weights, datasets, and recipes available on Hugging Face, allowing developers to customize, optimize, and deploy on their own infrastructure. NVIDIA provides tutorials and integration guides for tools like Perplexity and OpenCode to help developers get started immediately.

Addressing Multi-Agent Scaling Challenges

Key Architectural Innovations

Performance and Availability

Tags

Published

Source