25x Performance Gain on GB300 NVL72
SGLang, in collaboration with NVIDIA, has unlocked significant inference performance improvements on the new GB300 NVL72 system. The framework achieves up to 25x higher performance when running DeepSeek R1 on GB300 NVL72 compared to H200 GPUs in latency-constrained scenarios (50 tokens/second/user). This represents a substantial acceleration from prior results that showed 4x gains on GB200 NVL72 compared to Hopper.
Hardware Enhancements
The GB300 NVL72 features Blackwell Ultra GPUs that build on the GB200 NVL72 foundation with three key improvements:
- 1.5x peak NVFP4 throughput: Enhanced Tensor Cores accelerate FP4 math operations critical for MoE expert layers
- 2x softmax throughput: Upgraded special function units double throughput on attention softmax operations
- 1.5x larger HBM3e capacity: Higher-capacity 12-Hi HBM3e stacks support larger models and batch sizes without CPU offload
These enhancements make GB300 NVL72 particularly well-suited for MoE model deployments that require low-latency all-to-all communication.
Software Optimizations
SGLang implemented targeted optimizations across the inference stack:
- NVFP4 GEMM: Uses low-precision FP4 format for MoE experts and dense layers, reducing memory bandwidth pressure while doubling communication efficiency for token dispatch
- Computation-communication overlap: Replaces traditional two-batch overlapping with a single-batch strategy tuned to NVL72's higher interconnect bandwidth, allowing communication to overlap with computation
- NVIDIA Dynamo integration: Leverages Dynamo's distributed serving engine with KV-aware routing coupled to SGLang's HiCache radix tree for efficient prefill-decode disaggregation
Continued Progress on GB200 NVL72
Beyond the new GB300 results, SGLang has also improved performance on existing GB200 NVL72 systems. In just under 4 months, the latest InferenceXv2 release delivers up to 8x more tokens-per-GPU in high-throughput scenarios and up to 4x more tokens-per-user in latency-sensitive deployments compared to prior results.
Future Roadmap
Upcoming work includes enabling MTP on GB300 NVL72, continued latency and throughput optimizations, tuning for Qwen model families, and bringing these optimizations to future NVIDIA Vera Rubin NVL72 systems. The collaboration underscores the joint effort between SGLang developers and NVIDIA to reduce deployment costs for frontier reasoning models.