vLLM v0.16.0 delivers 30% throughput gains with async scheduling and pipeline parallelism

Performance Improvements

vLLM v0.16.0 introduces 30.8% end-to-end throughput improvement and 31.8% TPOT (time per output token) improvement through full support of async scheduling combined with pipeline parallelism. The engine has also been optimized for speculative decode with async scheduling, delivering an additional 1.5% throughput improvement.

New Realtime API and Features

A new WebSocket-based Realtime API enables streaming audio interactions, building on the Voxtral realtime infrastructure. This addition expands vLLM's capabilities for real-time multimodal applications. The release also includes enhanced RLHF workflow improvements: native NCCL-based weight syncing, layerwise weight reloading for QeRL, and engine pause/resume functionality with request preservation.

Speculative Decoding Enhancements

Unified Parallel Drafting has been implemented for speculative decoding, and the feature now works seamlessly with structured outputs. Penalty application improvements in Model Runner V2 further enhance compatibility and performance.

Model and Architecture Expansion

The release adds support for 12 new model architectures including GLM-OCR with MTP, Qwen3-ASR, DeepSeek-OCR-2, Intern-S1-Pro, MiniCPM-o 4.5, and others. LoRA support has been expanded to include Gemma3 vision components and Nemotron-H MTP models, with optimized fused MoE-LoRA kernel indexing reducing overhead for multiple concurrent LoRAs.

Intel XPU Platform Overhaul

The XPU platform receives a major overhaul, deprecating IPEX in favor of vllm-xpu-kernels. New support has been added for MoE operations, MXFP4 MoE, WNA16, scaled matrix multiplication, and FP8 MoE.

Additional Improvements

The release includes fixes for MRoPE positioning in multimodal models, GLM-4.7-GPTQ decode regressions, and DeepSeek V3.2 tokenizer and fast detokenization optimizations. This is a substantial release representing contributions from 203 developers, with 7 new contributors joining the project.

Performance Improvements

New Realtime API and Features

Speculative Decoding Enhancements

Model and Architecture Expansion

Intel XPU Platform Overhaul

Additional Improvements

Tags

Published

Source