SGLang integrates Elastic EP for fault-tolerant MoE inference; tolerates up to 16 GPU failures with 90% reduction in recovery time

Elastic EP: Partial Failure Tolerance for Large-Scale MoE Inference

SGLang's latest update introduces Elastic EP, a fault-tolerance mechanism designed to handle partial GPU failures in wide Expert Parallelism (EP) deployments without restarting the entire inference instance. Large-scale MoE models like DeepSeek require wide EP (often 32+ GPUs) to achieve necessary batch sizes and latency, but this scale dramatically increases the statistical probability of hardware or process failures bringing down the entire system.

The Solution

Elastic EP decouples the rigid mapping between experts and specific GPUs by maintaining redundant experts across the cluster. When a failure is detected, the system automatically redistributes expert weights and reroutes tokens to surviving GPUs—all without halting inference. This is achieved through two structural layers:

Scheduler Layer: Monitors health status of Data Parallel (DP) ranks and filters out failed ones from new batch assignments
Expert Parallel Layer: Dynamically adjusts expert-to-GPU mappings in real-time to maintain mathematical correctness

Performance and Reliability Results

Testing on DeepSeek V3.2 with 4 nodes (32 GPUs) and 256 redundant experts shows:

Service recovery in ~6-7 seconds across all failure scenarios (1-16 failed ranks), representing a 90% reduction versus 2-3 minute full restarts
Graceful throughput degradation: Remaining GPUs continue inference at reduced but stable throughput proportional to available resources
Zero static performance overhead: Elastic EP matches standard DeepEP's baseline metrics (3560+ tokens/sec throughput, ~54ms TPOT) under normal conditions

Implementation Details

Elastic EP integrates with Mooncake, a fault-tolerant communication library that provides:

Resilient collective operations (broadcast, allgather)
Fault-tolerant MoE-specific primitives (dispatch, combine)
High-performance GPU Direct RDMA with rapid timeout-based fault detection

Enabling Elastic EP

Enable the feature when starting SGLang with:

--elastic-ep-backend mooncake
--moe-a2a-backend mooncake
--mooncake-ib-device <devices>
--ep-num-redundant-experts <num> (controls the number of failures tolerated)

Elastic EP: Partial Failure Tolerance for Large-Scale MoE Inference

The Solution

Performance and Reliability Results

Implementation Details

Enabling Elastic EP

Tags

Published

Source