← Back
NVIDIA and SGLang optimize DeepSeek for GB300 NVL72, achieving 226 tokens per second in 128K-token inference
· releasefeatureperformanceintegrationapi · lmsys.org ↗

Long-Context Inference Optimization for GB300

NVIDIA and the SGLang team have released production-ready optimizations for deploying DeepSeek R1 on the GB300 NVL72 GPU cluster. The work focuses on long-context inference scenarios (128K input / 8K output tokens) where throughput and latency are critical performance metrics.

Key Performance Gains

  • Peak throughput: 226.2 tokens per second per GPU on GB300, representing a 1.53x advantage over GB200 configurations
  • Under matched latency: GB300 delivers 1.38x–1.58x tokens per second compared to GB200 across representative workloads
  • Per-user throughput: Multi-token prediction enables an 1.87x increase in tokens per second per user
  • TTFT improvement: 8.6 seconds for 128K token prefill, 1.07x–1.23x faster than GB200, driven by optimized attention kernels

Technical Approaches

The optimization employs several complementary techniques:

Prefill optimizations: Chunked pipeline parallelism distributes long-context prompt processing across pipeline stages with dynamic chunking. The team enabled FP8 attention and FP8 KV-cache support to reduce memory traffic and double KV-cache capacity within fixed memory footprint. GB300's upgraded Special Function Unit (SFU) provides 2x throughput for softmax operations, yielding a 1.35x speedup in the FMHA kernel.

Decode optimization: Expert parallelism (Wide-EP) distributes MoE weights and KV cache across up to 32 GPUs, reducing per-GPU memory pressure. GB300's 288 GB of HBM3e (versus GB200's 192 GB) directly enables 1.6x higher effective decode batch size—40 vs 24 requests per GPU at DEP8.

Orchestration: The deployment uses NVIDIA Dynamo, a control plane for cluster-scale disaggregated inference that handles KV-cache-aware routing and worker coordination with near-zero scheduling overhead.

Reproduction and Production Deployment

Complete reproduction instructions are available in the SGLang GitHub repository (issue #18703). For production deployments, the Dynamo Kubernetes stack offers GB200/GB300 support with inference-aware autoscaling and cluster topology-aware scheduling for disaggregated inference workloads.