vLLM v0.16.0rc3 fixes MTP accuracy issue with GLM-5 model
vLLM v0.16.0rc3 Release
vLLM has released v0.16.0rc3, a release candidate version that includes a critical bugfix for Multi-Token Prediction (MTP) accuracy when using the GLM-5 model.
What Changed
This release focuses on correcting an accuracy regression in the MTP implementation for GLM-5. The bugfix ensures that Multi-Token Prediction produces correct results when generating multiple tokens in parallel, which is essential for optimal performance on compatible hardware.
Who This Affects
- Users running GLM-5 models on vLLM with MTP enabled
- Developers optimizing inference latency through parallel token generation
- Production deployments requiring high accuracy guarantees
Next Steps
Users should update to v0.16.0rc3 or wait for the stable v0.16.0 release. If you're currently using GLM-5 with MTP features, verify accuracy on your workloads with this patched version.