vLLM

vLLM — AI Infrastructure

Changelogs

vLLM v0.16.0 brings async scheduling with pipeline parallelism, 30.8% throughput gains//vLLM's latest release introduces async scheduling combined with pipeline parallelism, delivering significant performance improvements and new WebSocket-based realtime audio streaming capabilities. The update adds support for 12+ new model architectures and major enhancements to speculative decoding, RLHF workflows, and Intel XPU platforms.

releasefeatureperformanceapisdk

vLLM v0.16.1rc0 pins TorchCodec to v0.10.0 for ROCm compatibility//vLLM releases v0.16.1rc0 with a dependency pinning fix for ROCm users. The release pins TorchCodec to v0.10.0 to ensure stable GPU support on AMD hardware.

releasebugfix

vLLM v0.16.0rc3 fixes MTP accuracy issue with GLM-5 model//vLLM releases a release candidate patch addressing an accuracy bug in Multi-Token Prediction (MTP) functionality for the GLM-5 model. This fix ensures more reliable inference results for users running GLM-5 on vLLM infrastructure.

releasebugfix

v0.16.0rc2: Patch protobuf for CVE-2026-0994 (#34253)

vLLM v0.15.1 ships security patches, RTX Blackwell GPU fixes, and 4x faster torch.compile startup//vLLM's v0.15.1 patch release addresses two critical security vulnerabilities in dependencies (aiohttp and Protobuf) while fixing GPU support for RTX Blackwell workstations. The release also delivers significant performance improvements, including a 4x speedup in torch.compile cold-start times for large models like Llama3-70B.

releasebugfixsecurityperformancemodel

vLLM v0.15.2rc0 fixes TRTLLM attention conflict with KV cache transfer//vLLM releases v0.15.2rc0, a release candidate containing a critical bugfix that disables TRTLLM attention mechanisms when KV cache transfer is enabled. This patch resolves compatibility issues that could arise when both features are active simultaneously.

releasebugfix