← Back
vLLM v0.16.0rc3 fixes MTP accuracy issue with GLM-5 model
· releasebugfix · github.com ↗

vLLM v0.16.0rc3 Release

vLLM has released v0.16.0rc3, a release candidate version that includes a critical bugfix for Multi-Token Prediction (MTP) accuracy when using the GLM-5 model.

What Changed

This release focuses on correcting an accuracy regression in the MTP implementation for GLM-5. The bugfix ensures that Multi-Token Prediction produces correct results when generating multiple tokens in parallel, which is essential for optimal performance on compatible hardware.

Who This Affects

  • Users running GLM-5 models on vLLM with MTP enabled
  • Developers optimizing inference latency through parallel token generation
  • Production deployments requiring high accuracy guarantees

Next Steps

Users should update to v0.16.0rc3 or wait for the stable v0.16.0 release. If you're currently using GLM-5 with MTP features, verify accuracy on your workloads with this patched version.