← Back
vLLM v0.15.1 ships security patches, RTX Blackwell GPU fixes, and 4x faster torch.compile startup
· releasebugfixsecurityperformancemodel · github.com ↗

Security Updates

vLLM v0.15.1 patches two critical security vulnerabilities in upstream dependencies:

  • CVE-2025-69223: Updated aiohttp dependency to address security vulnerability
  • CVE-2026-0994: Updated Protobuf dependency to address security vulnerability

Users running production deployments should upgrade to ensure these fixes are applied.

Hardware Support & Bug Fixes

RTX Blackwell GPU Support: Fixed critical issues preventing NVFP4 MoE (Mixture of Experts) models from loading on RTX Blackwell (SM120) workstation GPUs. Additionally resolved FP8 CUTLASS group GEMM kernel selection, which now properly falls back to Triton kernels on SM120 hardware.

Model Support: Added support for Step-3.5-Flash model and fixed loading issues for Qwen3-VL-Reranker and Whisper with FlashAttention2.

Performance Improvements

torch.compile Startup: Fixed a regression that significantly increased cold-start compilation time. Llama3-70B cold-start compilation now completes in ~22 seconds, down from ~88 seconds—a 4x improvement.

MoE Optimization: Optimized the Mixture of Experts forward pass by caching layer name computation, reducing redundant operations.

Additional Fixes

  • Resolved prefix cache hit rate issue with GPT-OSS style hybrid attention models (previously hitting 0%)
  • Enabled Triton MoE backend for FP8 per-tensor dynamic quantization
  • Fixed speculative decoding metrics crash when no tokens are generated
  • Fixed ROCm skinny GEMM dispatch logic
  • Pinned LMCache dependency >= v0.3.9 for API compatibility

Developers using vLLM with Blackwell GPUs or large model inference should prioritize this upgrade for both security and performance improvements.