NVIDIA CMX: A Purpose-Built Storage Tier for Agentic AI
NVIDIA has introduced the CMX (Context Memory Storage) platform as a critical new tier in its Vera Rubin AI factory architecture. This storage system leverages NVIDIA BlueField-4 data processing units and is specifically designed to address growing scalability challenges posed by agentic AI workloads with million-token context windows and trillion-parameter models.
The Problem: Pressure on Memory Hierarchies
As AI models scale and context windows expand to millions of tokens, Key-Value (KV) cache—the mechanism that preserves inference context—grows proportionally. Organizations currently face a difficult choice: store KV cache in scarce, expensive GPU high-bandwidth memory (HBM) or relegate it to general-purpose storage optimized for durability rather than low-latency AI workloads. This forces higher power consumption, inflated cost-per-token, and GPU underutilization.
CMX Architecture and Performance
The CMX platform bridges this gap by providing a dedicated middle tier optimized for ephemeral, latency-sensitive inference context. Key capabilities include:
- 5x higher tokens-per-second (TPS) throughput compared to traditional storage
- 5x greater power efficiency for serving KV cache at scale
- Petabyte-scale storage enabling scalable KV cache reuse across multiple inference sessions
- NVIDIA Spectrum-X Ethernet integration for predictable, low-latency RDMA connectivity at gigascale
- Seamless GPU memory extension within pod-level architecture alongside NVIDIA BlueField-4 networking
Integration with Vera Rubin Platform
CMX operates within NVIDIA's Vera Rubin platform, which organizes AI infrastructure into compute, networking, and storage racks as configurable building blocks. The platform is complemented by orchestration tools like NVIDIA Dynamo and NIXL, which coordinate context placement, KV block management, and workload scheduling across the memory hierarchy for stateless sharing across AI nodes.
Developer Impact
This infrastructure advancement enables organizations to efficiently handle agentic reasoning workloads that require persistent context across multiple turns and sessions. By reducing the pressure on GPU memory and traditional storage, developers can build more complex multi-agent systems while maintaining cost efficiency and throughput performance.