LMSYS Changelogs | StandardDB

NVIDIA and SGLang optimize DeepSeek for GB300 NVL72, achieving 226 tokens per second in 128K-token inference//NVIDIA and the SGLang team have published optimizations for running DeepSeek R1 on the GB300 NVL72 GPU, leveraging prefill-decode disaggregation, pipeline parallelism, and expert parallelism to achieve 226 tokens per second per GPU on long-context workloads. The optimization demonstrates a 1.53x throughput advantage over GB200 under identical conditions, with further gains possible through multi-token prediction.

releasefeatureperformanceintegrationapi

SGLang-Diffusion v2 Adds Token-Level Sharding and Parallel VAE for Video Generation//SGLang-Diffusion releases advanced optimizations for production-ready video generation, including token-level sequence sharding, distributed VAE encoding/decoding, and fixed Cache-DiT serving stability. The updates focus on eliminating memory bottlenecks and reducing communication overhead across multi-GPU setups.

featureperformancesdkopen-source