Ollama v0.17.5 adds Qwen3.5 small models, fixes GPU/CPU split crashes
New Models
Ollama now supports the Qwen3.5 small model series with four size options: 0.8B, 2B, 4B, and 9B parameters. These smaller variants provide more efficient alternatives for resource-constrained environments while maintaining the capabilities of the larger Qwen3.5 lineup.
Bug Fixes and Improvements
This patch release focuses on stability and performance improvements:
- GPU/CPU split crash: Fixed a critical crash that occurred when Qwen3.5 models were split between GPU and CPU memory
- Token repetition: Resolved an issue where Qwen3.5 models would repeat themselves due to missing presence penalty. Note that users may need to redownload Qwen3.5 models (e.g.,
ollama pull qwen3.5:35b) to apply this fix - Memory monitoring: The
ollama run --verbosecommand now displays peak memory usage when using Ollama's MLX engine - MLX stability: Fixed memory issues and crashes affecting the MLX runner
- GGUF compatibility: Resolved an issue preventing Ollama from running models imported from Qwen3.5 GGUF files
Action Items
Users experiencing Qwen3.5 model issues should pull the latest versions to receive the presence penalty fix. Developers testing Ollama's MLX engine will benefit from improved stability and better visibility into memory consumption via the verbose flag.