Ollama v0.17.8-rc2 brings MLX performance improvements with optimized layer normalization

MLX Performance Optimizations

This release candidate includes targeted performance improvements to Ollama's Metal Learning Extensions (MLX) backend:

Layer Normalization Optimization

The nn.go module now uses the native mlx_fast_layer_norm function instead of manually implementing layer normalization across six separate operations (mean, subtract, variance, reciprocal square root, multiply, add). This consolidation reduces computational overhead and allows the MLX framework to optimize the operation at a lower level.

Grouped Query Attention (GQA) Simplification

Both llama.go and gemma3.go have been updated to remove redundant RepeatKV operations that were previously tiling Key and Value tensors to match Query head counts. Since the native scaled_dot_product_attention function already handles Grouped Query Attention natively (requiring only that n_q_heads % n_kv_heads == 0), this removes unnecessary tensor operations from the inference path.

What Developers Need to Know

These changes are implementation-level optimizations that should be transparent to users running Ollama. The release candidate is tagged as RC2, indicating it's nearing stability. Users testing the MLX backend should see faster inference performance, particularly on Apple Silicon hardware where MLX is optimized.

MLX Performance Optimizations

Layer Normalization Optimization

Grouped Query Attention (GQA) Simplification

What Developers Need to Know

Tags

Published

Source