NVIDIA and SGLang optimize DeepSeek for GB300 NVL72, achieving 226 tokens per second in 128K-token inference//NVIDIA and the SGLang team have published optimizations for running DeepSeek R1 on the GB300 NVL72 GPU, leveraging prefill-decode disaggregation, pipeline parallelism, and expert parallelism to achieve 226 tokens per second per GPU on long-context workloads. The optimization demonstrates a 1.53x throughput advantage over GB200 under identical conditions, with further gains possible through multi-token prediction.
releasefeatureperformanceintegrationapi