← Back
LMSYS
SGLang adds day-one support for NVIDIA Nemotron 3 Super, a 120B-parameter mixture-of-experts model optimized for multi-agent systems
· releasemodelfeatureopen-sourcesdk · lmsys.org ↗

Nemotron 3 Super Now Available in SGLang

SGLang has added day-0 support for NVIDIA's Nemotron 3 Super, an open-weights large language model designed specifically for building scalable multi-agent AI systems. The model is now ready for inference serving and deployment on popular GPU architectures.

Key Technical Specifications

Nemotron 3 Super is a 120B-parameter mixture-of-experts (MoE) model with a hybrid Transformer-Mamba architecture. Despite its large parameter count, it activates only 12B parameters per forward pass, delivering leading accuracy at a fraction of the computational cost. Notable features include:

  • 1M-token context window for maintaining full conversation history and plan state across long multi-agent workflows
  • Up to 5x higher throughput compared to the previous Nemotron Super model (Llama Nemotron Super 1.5)
  • 2x higher accuracy on Artificial Analysis Intelligence Index benchmarks
  • Multi-token prediction for faster long-form text generation
  • Thinking budget support for optimizing accuracy with minimal reasoning token overhead
  • Latent MoE design that enables calling 4 experts for the inference cost of one

The model supports multiple quantization formats (BF16, FP8, NVFP4) and runs on B200, H100, H200, DGX Spark, and RTX 6000 GPUs.

Getting Started with SGLang

Developers can immediately begin serving Nemotron 3 Super using SGLang's launch server command. Installation requires:

pip install 'git+https://github.com/sgl-project/sglang.git#subdirectory=python'

The model can then be served on multi-GPU setups (example provided for 4xH200) with built-in support for tool calling and reasoning parsers. SGLang exposes the model via an OpenAI-compatible API, allowing existing applications to integrate with minimal changes.

Why Nemotron 3 Super for Multi-Agent Systems

The model is purpose-built for orchestrating multiple collaborating agents. Its 1M-token context enables agents to maintain full conversation history and plan state without fragmentation, reducing goal drift in multi-step workflows. Use cases include code generation and debugging, research summarization, alert triage, and document analysis. As a fully open model with published datasets and training recipes, developers can fine-tune and deploy on their own infrastructure for maximum privacy and control.

Getting started: Download weights from Hugging Face, follow the SGLang cookbook, or use the NVIDIA Brev launchable for one-click deployment. The technical report provides guidance for building custom optimized variants.