vLLM v0.16.0rc3 fixes MTP accuracy issue with GLM-5 model

vLLM v0.16.0rc3 Release

vLLM has released v0.16.0rc3, a release candidate version that includes a critical bugfix for Multi-Token Prediction (MTP) accuracy when using the GLM-5 model.

What Changed

This release focuses on correcting an accuracy regression in the MTP implementation for GLM-5. The bugfix ensures that Multi-Token Prediction produces correct results when generating multiple tokens in parallel, which is essential for optimal performance on compatible hardware.

Who This Affects

Users running GLM-5 models on vLLM with MTP enabled
Developers optimizing inference latency through parallel token generation
Production deployments requiring high accuracy guarantees

Next Steps

Users should update to v0.16.0rc3 or wait for the stable v0.16.0 release. If you're currently using GLM-5 with MTP features, verify accuracy on your workloads with this patched version.

vLLM v0.16.0rc3 Release

What Changed

Who This Affects

Next Steps

Tags

Published

Source