Hugging Face launches community-driven model evaluations with decentralized benchmarking

Decentralized Model Evaluation Now Live

Hugging Face is rolling out a new evaluation framework that addresses the fragmentation and lack of transparency in model benchmarking. Rather than relying on siloed leaderboards or competing claims across papers and platforms, the Hub now enables community members to submit evaluation results transparently.

How It Works

For Benchmarks: Dataset repositories can register as benchmarks (MMLU-Pro, GPQA, HLE are initial partners) and automatically aggregate reported results into leaderboards. Each benchmark defines its evaluation spec via eval.yaml using the Inspect AI format, making results reproducible.
For Models: Evaluation results are stored in .eval_results/*.yaml within model repositories, visible both on model cards and aggregated into benchmark leaderboards. Both official model author results and community pull requests are included.
For the Community: Any user can submit evaluation results via pull request for any model. Results appear immediately as "community" contributions while awaiting review, and users can link to supporting sources like papers, logs, or third-party platforms.

Why This Matters

The current benchmarking landscape has two critical problems: established benchmarks like MMLU (91%+ saturation) no longer differentiate models effectively, yet reported scores vary across sources. This creates a gap between benchmark numbers and real-world performance.

By exposing evaluation results through a transparent, git-based system with full history and attribution, the Hub addresses score fragmentation and enables the community to build curated leaderboards and dashboards via public APIs. The framework emphasizes reproducibility through open eval specs while acknowledging it won't solve benchmark saturation or prevent test set overfitting—but it makes evaluation practices visible.

Getting Started

Users can publish evals as YAML files in any model repository's .eval_results/ directory, browse existing scores on benchmark dataset pages, or register new benchmarks by adding eval.yaml and contacting the team. The feature is currently in beta.

Decentralized Model Evaluation Now Live

Why This Matters

Getting Started

Tags

Published

Source