vLLM v0.15.1 ships security patches, RTX Blackwell GPU fixes, and 4x faster torch.compile startup

Security Updates

vLLM v0.15.1 patches two critical security vulnerabilities in upstream dependencies:

CVE-2025-69223: Updated aiohttp dependency to address security vulnerability
CVE-2026-0994: Updated Protobuf dependency to address security vulnerability

Users running production deployments should upgrade to ensure these fixes are applied.

Hardware Support & Bug Fixes

RTX Blackwell GPU Support: Fixed critical issues preventing NVFP4 MoE (Mixture of Experts) models from loading on RTX Blackwell (SM120) workstation GPUs. Additionally resolved FP8 CUTLASS group GEMM kernel selection, which now properly falls back to Triton kernels on SM120 hardware.

Model Support: Added support for Step-3.5-Flash model and fixed loading issues for Qwen3-VL-Reranker and Whisper with FlashAttention2.

Performance Improvements

torch.compile Startup: Fixed a regression that significantly increased cold-start compilation time. Llama3-70B cold-start compilation now completes in ~22 seconds, down from ~88 seconds—a 4x improvement.

MoE Optimization: Optimized the Mixture of Experts forward pass by caching layer name computation, reducing redundant operations.

Additional Fixes

Resolved prefix cache hit rate issue with GPT-OSS style hybrid attention models (previously hitting 0%)
Enabled Triton MoE backend for FP8 per-tensor dynamic quantization
Fixed speculative decoding metrics crash when no tokens are generated
Fixed ROCm skinny GEMM dispatch logic
Pinned LMCache dependency >= v0.3.9 for API compatibility

Developers using vLLM with Blackwell GPUs or large model inference should prioritize this upgrade for both security and performance improvements.

Security Updates

Hardware Support & Bug Fixes

Performance Improvements

Additional Fixes

Tags

Published

Source