Introduction
Hugging Face has introduced Storage Buckets, a new storage service purpose-built for the constant stream of intermediate ML artifacts that traditional Git-based version control handles poorly. Unlike Models and Datasets repositories, Buckets are designed for mutable, frequently-changing files that require fast writes, overwrites, and efficient syncing—making them ideal for training checkpoints, optimizer states, processed data shards, and agent traces.
Key Features
Non-versioned S3-like Storage: Buckets provide familiar S3-style object storage with standard Hugging Face permissions (private or public), browsable through the web interface, and accessible via Python scripting or the hf CLI using handles like hf://buckets/username/my-training-bucket.
Content Deduplication via Xet: Built on Hugging Face's chunk-based storage backend, Buckets automatically deduplicate content across files. When uploading processed datasets similar to raw versions, or successive model checkpoints with frozen layers, Xet skips redundant chunks. This reduces bandwidth, accelerates transfers, and lowers storage costs—with Enterprise billing based on deduplicated storage footprint.
Pre-warming for Distributed Workloads: The service includes pre-warming capabilities that bring hot data closer to your compute infrastructure. Partnerships with AWS and GCP allow teams to declare which regions need data access, ensuring datasets and checkpoints are co-located with training clusters before jobs start. This directly improves throughput for multi-region pipelines and large-scale training.
Getting Started
Users can set up a bucket in under two minutes using the CLI:
curl -LsSf https://hf.co/cli/install.sh | bash
hf auth login
hf buckets create
The service is immediately available through the Hugging Face Hub, with integrations to Python workflows and direct CLI management.