vLLM v0.16.0 ships async scheduling with pipeline parallelism, 31% throughput gains

Core Performance Improvements

vLLM v0.16.0 marks a major milestone with async scheduling + pipeline parallelism now fully supported. This combination delivers impressive performance gains:

30.8% end-to-end throughput improvement
31.8% TPOT (time per output token) improvement
Optimized spec decode + async scheduling adds another 1.5% throughput gain

New APIs and Features

Realtime API: A new WebSocket-based Realtime API enables streaming audio interactions, building on the Voxtral realtime infrastructure for low-latency conversational experiences.

RLHF Workflow: Significant improvements for reinforcement learning workflows include:

Native NCCL-based weight syncing API
Layerwise weight reloading for QeRL
Engine pause/resume with request preservation

Speculative Decoding: Unified Parallel Drafting now supports structured outputs and penalty application in Model Runner V2, expanding use cases for faster inference.

Model Support Expansion

This release adds support for 12 new model architectures including GLM-OCR with MTP, Qwen3-ASR, DeepSeek-OCR-2, Intern-S1-Pro, and MiniCPM-o 4.5. LoRA support expands to Gemma3 vision components and optimizations for MoE-LoRA inference. Additional improvements address performance regressions and add support for advanced features like Qwen3-Omni transcription and MRoPE positioning fixes.

XPU Platform Overhaul

A major platform upgrade deprecated IPEX in favor of vllm-xpu-kernels, adding comprehensive support for MoE, MXFP4 MoE, WNA16, scaled_mm, and FP8 MoE kernels.

Release Details

This release features 440 commits from 203 contributors (including 7 new contributors). The branch cut occurred on February 8, so features added after that date are not included in this version.

Core Performance Improvements

New APIs and Features

Model Support Expansion

XPU Platform Overhaul

Release Details

Tags

Published

Source