Long-Context Inference Optimization for GB300
NVIDIA and the SGLang team have released production-ready optimizations for deploying DeepSeek R1 on the GB300 NVL72 GPU cluster. The work focuses on long-context inference scenarios (128K input / 8K output tokens) where throughput and latency are critical performance metrics.
Key Performance Gains
- Peak throughput: 226.2 tokens per second per GPU on GB300, representing a 1.53x advantage over GB200 configurations
- Under matched latency: GB300 delivers 1.38x–1.58x tokens per second compared to GB200 across representative workloads
- Per-user throughput: Multi-token prediction enables an 1.87x increase in tokens per second per user
- TTFT improvement: 8.6 seconds for 128K token prefill, 1.07x–1.23x faster than GB200, driven by optimized attention kernels
Technical Approaches
The optimization employs several complementary techniques:
Prefill optimizations: Chunked pipeline parallelism distributes long-context prompt processing across pipeline stages with dynamic chunking. The team enabled FP8 attention and FP8 KV-cache support to reduce memory traffic and double KV-cache capacity within fixed memory footprint. GB300's upgraded Special Function Unit (SFU) provides 2x throughput for softmax operations, yielding a 1.35x speedup in the FMHA kernel.
Decode optimization: Expert parallelism (Wide-EP) distributes MoE weights and KV cache across up to 32 GPUs, reducing per-GPU memory pressure. GB300's 288 GB of HBM3e (versus GB200's 192 GB) directly enables 1.6x higher effective decode batch size—40 vs 24 requests per GPU at DEP8.
Orchestration: The deployment uses NVIDIA Dynamo, a control plane for cluster-scale disaggregated inference that handles KV-cache-aware routing and worker coordination with near-zero scheduling overhead.
Reproduction and Production Deployment
Complete reproduction instructions are available in the SGLang GitHub repository (issue #18703). For production deployments, the Dynamo Kubernetes stack offers GB200/GB300 support with inference-aware autoscaling and cluster topology-aware scheduling for disaggregated inference workloads.