Core Performance Improvements
vLLM v0.16.0 marks a major milestone with async scheduling + pipeline parallelism now fully supported. This combination delivers impressive performance gains:
- 30.8% end-to-end throughput improvement
- 31.8% TPOT (time per output token) improvement
- Optimized spec decode + async scheduling adds another 1.5% throughput gain
New APIs and Features
Realtime API: A new WebSocket-based Realtime API enables streaming audio interactions, building on the Voxtral realtime infrastructure for low-latency conversational experiences.
RLHF Workflow: Significant improvements for reinforcement learning workflows include:
- Native NCCL-based weight syncing API
- Layerwise weight reloading for QeRL
- Engine pause/resume with request preservation
Speculative Decoding: Unified Parallel Drafting now supports structured outputs and penalty application in Model Runner V2, expanding use cases for faster inference.
Model Support Expansion
This release adds support for 12 new model architectures including GLM-OCR with MTP, Qwen3-ASR, DeepSeek-OCR-2, Intern-S1-Pro, and MiniCPM-o 4.5. LoRA support expands to Gemma3 vision components and optimizations for MoE-LoRA inference. Additional improvements address performance regressions and add support for advanced features like Qwen3-Omni transcription and MRoPE positioning fixes.
XPU Platform Overhaul
A major platform upgrade deprecated IPEX in favor of vllm-xpu-kernels, adding comprehensive support for MoE, MXFP4 MoE, WNA16, scaled_mm, and FP8 MoE kernels.
Release Details
This release features 440 commits from 203 contributors (including 7 new contributors). The branch cut occurred on February 8, so features added after that date are not included in this version.
vLLM