Unsloth ships MoE training kernels with 12x speedup and 35% lower VRAM usage

MoE Training Performance Breakthrough

Unsloth released significant optimizations for training Mixture of Experts language models, achieving ~12x faster training compared to Transformers v4, with >35% lower VRAM usage and support for ~6x longer context windows. The improvements come through a combination of custom Triton kernels and a novel "Split LoRA" approach that maintains full accuracy.

Supported Models and Hardware

The platform now supports fast MoE training for major models including:

Qwen3 (30B, 235B, VL, Coder variants)
DeepSeek R1 and V3
GPT-OSS (20B, 120B, 500K context versions)
GLM (4.6, 4.7, Flash)

Notably, gpt-oss-20b now fine-tunes on just 12.8 GB VRAM, and the kernels work across data-center GPUs (B200, H100), consumer hardware (RTX 3090), and older generations, with support for FFT, LoRA, and QLoRA training methods.

Technical Innovations

Automatic Backend Selection

Unsloth automatically selects the optimal backend for your hardware:

grouped_mm: PyTorch's native torch._grouped_mm function (optimized for H100s+, works on T4 through B200)
unsloth_triton: Custom Triton kernels achieving 2.5× speedup over grouped_mm on A100s, with auto-tuning that adds ~2 minutes overhead but yields 35% faster training runs
native_torch: Fallback for compatibility, though 12x slower

Split LoRA Approach

The key innovation is the Split LoRA method for efficient MoE training, which reduces memory consumption by ~35% and delivers 2x faster training compared to Transformers v5 + torch._grouped_mm.

Integration with Transformers

This work was done in collaboration with Hugging Face, standardizing MoE training with PyTorch's torch._grouped_mm. Transformers v5 reorganized expert weights from a ModuleList of individual experts to a single nn.Parameter, enabling grouped matrix multiplication for dramatically faster computation. Unsloth's custom kernels push performance even further on compatible hardware.

Getting Started

Users can upgrade via pip install --upgrade unsloth unsloth_zoo and access pre-built Colab notebooks for various models and configurations. Backend selection can be manually overridden via environment variables (UNSLOTH_MOE_BACKEND). Note: 4-bit QLoRA training for MoE models is not currently recommended due to BitsandBytes limitations; use bf16 for now.