New Memory Tier for Long-Context AI Workloads
NVIDIA's CMX context memory storage platform addresses a critical bottleneck in modern AI inference: efficiently storing and serving Key-Value (KV) cache as context windows grow to millions of tokens. The system is integrated into NVIDIA's Vera Rubin platform as a dedicated infrastructure tier, sitting between GPU high-bandwidth memory (HBM) and general-purpose storage.
Architecture and Key Benefits
CMX leverages several NVIDIA technologies:
- BlueField-4 processors provide intelligent data processing and acceleration
- Spectrum-X Ethernet delivers low-latency, high-bandwidth RDMA connectivity for consistent access to shared KV cache
- NVIDIA DOCA framework enables optimized context placement and KV block management
The platform delivers significant performance improvements: 5x higher tokens-per-second (TPS) compared to traditional storage and 5x greater power efficiency, reducing cost-per-token for long-context inference while maximizing GPU utilization.
Solving Agentic AI Scaling Challenges
As AI systems evolve from stateless chatbots to complex, multi-turn agentic workflows with trillions of parameters, KV cache has become critical long-term memory. Growing context windows (millions of tokens) create a dilemma: GPU HBM is scarce and expensive, while general-purpose storage tiers are optimized for durability, not ephemeral AI workloads.
CMX fills this gap by providing petabyte-scale context storage specifically optimized for latency-sensitive, reusable inference context. The system enables efficient KV cache sharing across AI nodes within a pod-level architecture, reducing inference stalls and improving responsiveness for agentic, long-context workloads.
Integration with Vera Rubin AI Factories
CMX is deployed within the NVIDIA Vera Rubin platform's pod-level architecture, which organizes AI infrastructure into compute, networking, and storage racks as configurable building blocks. The platform supports coordination across the memory hierarchy using orchestration tools like NVIDIA Dynamo and NIXL, ensuring optimal context placement and workload scheduling.