Nemotron 3 Super Now Available in SGLang
SGLang has added day-0 support for NVIDIA's Nemotron 3 Super, an open-weights large language model designed specifically for building scalable multi-agent AI systems. The model is now ready for inference serving and deployment on popular GPU architectures.
Key Technical Specifications
Nemotron 3 Super is a 120B-parameter mixture-of-experts (MoE) model with a hybrid Transformer-Mamba architecture. Despite its large parameter count, it activates only 12B parameters per forward pass, delivering leading accuracy at a fraction of the computational cost. Notable features include:
- 1M-token context window for maintaining full conversation history and plan state across long multi-agent workflows
- Up to 5x higher throughput compared to the previous Nemotron Super model (Llama Nemotron Super 1.5)
- 2x higher accuracy on Artificial Analysis Intelligence Index benchmarks
- Multi-token prediction for faster long-form text generation
- Thinking budget support for optimizing accuracy with minimal reasoning token overhead
- Latent MoE design that enables calling 4 experts for the inference cost of one
The model supports multiple quantization formats (BF16, FP8, NVFP4) and runs on B200, H100, H200, DGX Spark, and RTX 6000 GPUs.
Getting Started with SGLang
Developers can immediately begin serving Nemotron 3 Super using SGLang's launch server command. Installation requires:
pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python'
The model can then be served on multi-GPU setups (example provided for 4xH200) with built-in support for tool calling and reasoning parsers. SGLang exposes the model via an OpenAI-compatible API, allowing existing applications to integrate with minimal changes.
Why Nemotron 3 Super for Multi-Agent Systems
The model is purpose-built for orchestrating multiple collaborating agents. Its 1M-token context enables agents to maintain full conversation history and plan state without fragmentation, reducing goal drift in multi-step workflows. Use cases include code generation and debugging, research summarization, alert triage, and document analysis. As a fully open model with published datasets and training recipes, developers can fine-tune and deploy on their own infrastructure for maximum privacy and control.
Getting started: Download weights from Hugging Face, follow the SGLang cookbook, or use the NVIDIA Brev launchable for one-click deployment. The technical report provides guidance for building custom optimized variants.