MoE Training Performance Breakthrough
Unsloth released significant optimizations for training Mixture of Experts language models, achieving ~12x faster training compared to Transformers v4, with >35% lower VRAM usage and support for ~6x longer context windows. The improvements come through a combination of custom Triton kernels and a novel "Split LoRA" approach that maintains full accuracy.
Supported Models and Hardware
The platform now supports fast MoE training for major models including:
- Qwen3 (30B, 235B, VL, Coder variants)
- DeepSeek R1 and V3
- GPT-OSS (20B, 120B, 500K context versions)
- GLM (4.6, 4.7, Flash)
Notably, gpt-oss-20b now fine-tunes on just 12.8 GB VRAM, and the kernels work across data-center GPUs (B200, H100), consumer hardware (RTX 3090), and older generations, with support for FFT, LoRA, and QLoRA training methods.
Technical Innovations
Automatic Backend Selection
Unsloth automatically selects the optimal backend for your hardware:
- grouped_mm: PyTorch's native
torch._grouped_mmfunction (optimized for H100s+, works on T4 through B200) - unsloth_triton: Custom Triton kernels achieving 2.5× speedup over grouped_mm on A100s, with auto-tuning that adds ~2 minutes overhead but yields 35% faster training runs
- native_torch: Fallback for compatibility, though 12x slower
Split LoRA Approach
The key innovation is the Split LoRA method for efficient MoE training, which reduces memory consumption by ~35% and delivers 2x faster training compared to Transformers v5 + torch._grouped_mm.
Integration with Transformers
This work was done in collaboration with Hugging Face, standardizing MoE training with PyTorch's torch._grouped_mm. Transformers v5 reorganized expert weights from a ModuleList of individual experts to a single nn.Parameter, enabling grouped matrix multiplication for dramatically faster computation. Unsloth's custom kernels push performance even further on compatible hardware.
Getting Started
Users can upgrade via pip install --upgrade unsloth unsloth_zoo and access pre-built Colab notebooks for various models and configurations. Backend selection can be manually overridden via environment variables (UNSLOTH_MOE_BACKEND). Note: 4-bit QLoRA training for MoE models is not currently recommended due to BitsandBytes limitations; use bf16 for now.