← Back
Hugging Face
Hugging Face launches Storage Buckets, S3-like object storage optimized for ML workloads
· featureplatformapiintegration · huggingface.co ↗

What's New

Hugging Face has launched Storage Buckets, a purpose-built storage solution for the constant stream of intermediate ML artifacts—checkpoints, optimizer states, processed data shards, logs, and traces—that production pipelines generate. Unlike Models and Datasets repos optimized for versioned final artifacts, Buckets are mutable, S3-like object storage designed for rapid iteration and frequent overwrites.

Key Capabilities

Non-versioned, mutable storage: Buckets live under user or organization namespaces with standard Hugging Face permissions, support public/private access, and can be accessed programmatically via handles like hf://buckets/username/my-training-bucket. They come with a browsable Hub interface and full CLI support.

Content deduplication via Xet: Buckets leverage Hugging Face's Xet backend, which breaks files into chunks and deduplicates content across them. When uploading a processed dataset similar to the raw version or storing successive checkpoints with frozen model layers, only new chunks are transferred. This reduces bandwidth, accelerates transfers, and lowers storage costs—especially valuable for Enterprise customers billed on deduplicated storage.

Pre-warming for data locality: The system supports pre-warming to bring frequently accessed data closer to compute. You can declare which cloud provider and region your workload needs, and Buckets ensure data is cached locally before jobs start. This is critical for distributed training and multi-region pipelines where data locality directly impacts throughput. AWS and GCP are supported initially, with more providers coming.

Getting Started

Users can create a bucket in seconds using the hf CLI:

hf auth login
hf buckets create my-training-bucket --private

The entry point documentation shows quick integration with training pipelines and data workflows.

Use Cases

Buckets address pain points across ML production: training clusters writing checkpoints mid-run, data pipelines iteratively processing raw datasets, and agents storing traces and knowledge graphs. The mutable nature and deduplication make them ideal where Git versioning overhead becomes a bottleneck.