Olmix: Solving Data Mixing at Scale
Modern language models are trained on heterogeneous data sources, but determining the optimal mix of these sources has traditionally relied on manual experimentation and guesswork. AI2's new Olmix framework addresses two persistent challenges in LM development:
- Lack of configuration guidance: Existing mixing approaches make inconsistent choices about model sizes, experiment counts, and regression techniques—with little concrete justification in the literature.
- Expensive recomputation: As training datasets evolve through development iterations, practitioners typically must recompute mixtures from scratch, becoming a computational bottleneck.
OlmixBase: Empirically Grounded Configuration
The framework's first component, OlmixBase, provides research-validated defaults through systematic experimentation:
- Proxy model size: 15M parameters strike the optimal balance—large enough to achieve >0.89 rank correlation with target models, small enough to remain computationally efficient. Substantially smaller proxies (1M parameters) produce unreliable results (ρ = 0.73).
- Mixing cost scaling: Required proxy runs scale linearly with domain count (O(m) runs for m domains), giving practitioners a concrete compute allocation formula.
- Regression model: Log-linear regression provides consistent performance across settings while remaining competitive on downstream validation, replacing the ad-hoc model selection common in prior work.
- Data feasibility constraints: The method incorporates explicit repetition limits to prevent over-allocating scarce data types.
Mixture Reuse for Iterative Development
The second component introduces mixture reuse—a technique for efficiently updating mixtures when datasets change. Rather than recomputing from scratch, unchanged domains are bundled into a "virtual domain" while only changed domains are recomputed. This approach reduces compute costs significantly during the iterative dataset refinement typical of real LM development.
Empirical Results
In experiments on DCLM data across 52 downstream tasks (math, code, commonsense QA), Olmix achieved a 12% improvement on downstream tasks and demonstrated 3x better data efficiency compared to no mixing approach.
The framework is available as open-source code with accompanying technical report.