Elastic EP: Partial Failure Tolerance for Large-Scale MoE Inference
SGLang's latest update introduces Elastic EP, a fault-tolerance mechanism designed to handle partial GPU failures in wide Expert Parallelism (EP) deployments without restarting the entire inference instance. Large-scale MoE models like DeepSeek require wide EP (often 32+ GPUs) to achieve necessary batch sizes and latency, but this scale dramatically increases the statistical probability of hardware or process failures bringing down the entire system.
The Solution
Elastic EP decouples the rigid mapping between experts and specific GPUs by maintaining redundant experts across the cluster. When a failure is detected, the system automatically redistributes expert weights and reroutes tokens to surviving GPUs—all without halting inference. This is achieved through two structural layers:
- Scheduler Layer: Monitors health status of Data Parallel (DP) ranks and filters out failed ones from new batch assignments
- Expert Parallel Layer: Dynamically adjusts expert-to-GPU mappings in real-time to maintain mathematical correctness
Performance and Reliability Results
Testing on DeepSeek V3.2 with 4 nodes (32 GPUs) and 256 redundant experts shows:
- Service recovery in ~6-7 seconds across all failure scenarios (1-16 failed ranks), representing a 90% reduction versus 2-3 minute full restarts
- Graceful throughput degradation: Remaining GPUs continue inference at reduced but stable throughput proportional to available resources
- Zero static performance overhead: Elastic EP matches standard DeepEP's baseline metrics (3560+ tokens/sec throughput, ~54ms TPOT) under normal conditions
Implementation Details
Elastic EP integrates with Mooncake, a fault-tolerant communication library that provides:
- Resilient collective operations (broadcast, allgather)
- Fault-tolerant MoE-specific primitives (dispatch, combine)
- High-performance GPU Direct RDMA with rapid timeout-based fault detection
Enabling Elastic EP
Enable the feature when starting SGLang with:
--elastic-ep-backend mooncake--moe-a2a-backend mooncake--mooncake-ib-device <devices>--ep-num-redundant-experts <num>(controls the number of failures tolerated)