Ollama v0.17.8-rc2 optimizes MLX inference with layer norm and attention improvements

MLX Backend Performance Optimizations

Ollama v0.17.8-rc2 focuses on improving the efficiency of its MLX (Apple Silicon) backend with two key optimizations:

Layer Normalization Improvements

The nn.go module now calls the native mlx_fast_layer_norm function instead of manually implementing layer normalization through individual operations (mean, subtract, variance, rsqrt, multiply, add). This change consolidates six separate operations into a single optimized call, reducing computational overhead and leveraging MLX's built-in performance optimizations.

Attention Mechanism Refinement

Updates to llama.go and gemma3.go remove unnecessary RepeatKV tensor tiling operations. Since MLX's scaled_dot_product_attention natively handles Grouped Query Attention (GQA), the manual head-matching tiling is redundant when the constraint n_q_heads % n_kv_heads == 0 is satisfied. This eliminates extra tensor operations and improves inference efficiency.

Action Items

Users running Ollama on Apple Silicon hardware should benefit from improved inference speed once this release candidate is promoted to stable. No configuration changes are required—performance improvements apply automatically.

MLX Backend Performance Optimizations

Layer Normalization Improvements

Attention Mechanism Refinement

Action Items

Tags

Published

Source