Qwen3.5 Model Family
Alibaba has launched Qwen3.5, a comprehensive model family designed to serve diverse deployment scenarios. The lineup includes:
- Large models: 35B-A3B, 27B, 122B-A10B, and 397B-A17B parameters
- Small models: 0.8B, 2B, 4B, and 9B parameters
- Multimodal capabilities: Hybrid reasoning LLMs supporting vision, text, and agentic coding tasks
Key Features
Context & Language Support
- 256K context window (extendable to 1M via YaRN)
- Multilingual support across 201 languages
- Supports up to 32,768 output tokens
Reasoning Capabilities
- Hybrid thinking and non-thinking modes for flexible inference
- Thinking mode optimized for complex reasoning tasks
- Non-thinking (Instruct) mode for faster, direct responses
- Reasoning disabled by default on Small models (0.8B-9B)
Hardware Requirements The models support multiple quantization levels with varying memory footprints:
- 35B-A3B: 22GB (4-bit) on compatible devices like high-end Macs
- 27B: 17GB (4-bit)
- Small models (0.8B-9B): As low as 3GB (3-bit) to 19GB (BF16)
Deployment & Optimization
All model uploads use Unsloth Dynamic 2.0 quantization, which intelligently upcasts important layers to 8 or 16-bit precision within 4-bit quantization for superior performance. GGUF variants are available for llama.cpp-compatible backends (currently not compatible with Ollama).
Fine-tuning support is available through Unsloth, and comprehensive inference tutorials are provided for each model size. Developers can control reasoning behavior via chat template parameters (enable_thinking flag).
Recent Updates
A March 2 update addressed tool-calling improvements following chat template fixes, with benefits applying universally across all Qwen3.5 formats and uploaders. MXFP4 layers have been retired from select quantization variants (Q2_K_XL, Q3_K_XL, Q4_K_XL) based on quantization sensitivity analysis.