Core Performance Improvements
vLLM v0.16.0 marks a major release with 440 commits from 203 contributors, featuring async scheduling + pipeline parallelism now fully supported with documented 30.8% end-to-end throughput improvement and 31.8% time-per-output-token improvement. This represents a significant step forward for serving large language models at scale.
New Capabilities
Realtime API: A new WebSocket-based Realtime API enables streaming audio interactions, building on the Voxtral realtime infrastructure for conversational AI applications.
Speculative Decoding Enhancements:
- Unified Parallel Drafting framework simplifies deployment
- Speculative decoding now works with structured outputs
- Penalty application integrated into Model Runner V2
- Additional drafting model support including EAGLE3, AFMoE, and Mistral3
Model Support Expansion
This release adds support for 12+ new model architectures including:
- Vision models: GLM-OCR with MTP, DeepSeek-OCR-2, Intern-S1-Pro, MiniCPM-o 4.5
- Speech/Audio: Qwen3-ASR, FunAudioChat
- Other specialized models: ColBERT, voyage-4-nano, GLM-5
Additionally, LoRA support expanded to Gemma3 vision components and Nemotron-H MTP models with optimized fused MoE-LoRA kernel indexing.
Infrastructure & RLHF Improvements
- Native NCCL-based weight syncing API for distributed training
- Layerwise weight reloading for QeRL
- Engine pause/resume with request preservation for flexible training workflows
- Major XPU platform overhaul: Deprecated IPEX in favor of vllm-xpu-kernels with new support for MoE, MXFP4 MoE, WNA16, scaled_mm, and FP8 MoE
Action Items
Users should update to access performance gains with async scheduling and pipeline parallelism. Teams using RLHF workflows or Intel XPU hardware should review updated APIs and deprecation notices.