vLLM v0.16.0 brings async scheduling with pipeline parallelism, 30.8% throughput gains

Core Performance Improvements

vLLM v0.16.0 marks a major release with 440 commits from 203 contributors, featuring async scheduling + pipeline parallelism now fully supported with documented 30.8% end-to-end throughput improvement and 31.8% time-per-output-token improvement. This represents a significant step forward for serving large language models at scale.

New Capabilities

Realtime API: A new WebSocket-based Realtime API enables streaming audio interactions, building on the Voxtral realtime infrastructure for conversational AI applications.

Speculative Decoding Enhancements:

Unified Parallel Drafting framework simplifies deployment
Speculative decoding now works with structured outputs
Penalty application integrated into Model Runner V2
Additional drafting model support including EAGLE3, AFMoE, and Mistral3

Model Support Expansion

This release adds support for 12+ new model architectures including:

Vision models: GLM-OCR with MTP, DeepSeek-OCR-2, Intern-S1-Pro, MiniCPM-o 4.5
Speech/Audio: Qwen3-ASR, FunAudioChat
Other specialized models: ColBERT, voyage-4-nano, GLM-5

Additionally, LoRA support expanded to Gemma3 vision components and Nemotron-H MTP models with optimized fused MoE-LoRA kernel indexing.

Infrastructure & RLHF Improvements

Native NCCL-based weight syncing API for distributed training
Layerwise weight reloading for QeRL
Engine pause/resume with request preservation for flexible training workflows
Major XPU platform overhaul: Deprecated IPEX in favor of vllm-xpu-kernels with new support for MoE, MXFP4 MoE, WNA16, scaled_mm, and FP8 MoE

Action Items

Users should update to access performance gains with async scheduling and pipeline parallelism. Teams using RLHF workflows or Intel XPU hardware should review updated APIs and deprecation notices.

Core Performance Improvements

New Capabilities

Model Support Expansion

Infrastructure & RLHF Improvements

Action Items

Tags

Published

Source