SGLang-Diffusion v2 Adds Token-Level Sharding and Parallel VAE for Video Generation

Token-Level Sequence Sharding

SGLang-Diffusion improves sequence parallel sharding by flattening the temporal, height, and width dimensions (T × H × W) into a single sequence dimension before sharding across GPUs. This replaces the previous frame-level approach, which required padding frames and introduced significant computational overhead. For typical video resolutions, the new token-level strategy eliminates padding entirely while reducing all-to-all communication volume by up to 12.5%.

Distributed VAE and Parallel Folding

The framework now supports Parallel VAE with height-wise sharding for high-resolution video encoding/decoding, using halo_exchange for boundary pixel sharing and all_gather for attention operations. Additionally, Parallel Folding decouples Text Encoder and DiT parallelism strategies, allowing the Text Encoder to reuse the DiT's sequence parallel group as tensor parallelism, improving memory efficiency without sacrificing throughput.

Cache-DiT Multi-Request Stability

Critical bugs in Cache-DiT serving have been fixed:

Each transformer in the dual-transformer architecture (transformer and transformer_2) now independently manages cache contexts using their own step counts
Cache contexts are refreshed for each new request, preventing buffer contamination and shape mismatch crashes when handling variable video shapes

I/O and Kernel Optimizations

Video save operations now run directly on GPU workers, eliminating serialization/deserialization overhead and tensor copies. Additionally, custom JIT kernels for WanVideo LayerNorm variants (including LayerNormScaleShift and ScaleResidualLayerNormScaleShift) reduce GPU bubbles by fusing elementwise operations with normalization reductions.

Token-Level Sequence Sharding

Distributed VAE and Parallel Folding

Cache-DiT Multi-Request Stability

I/O and Kernel Optimizations

Tags

Published

Source