Overview
The NVIDIA CMX context memory storage platform addresses a critical bottleneck in modern AI inference: managing massive Key-Value (KV) cache requirements as context windows expand to millions of tokens and models scale toward trillions of parameters. CMX fills a gap between GPU high-bandwidth memory (HBM) and general-purpose storage by providing a specialized, optimized tier for ephemeral, latency-sensitive context storage at gigascale.
The Problem
As agentic AI systems evolve from stateless chatbots to complex, multi-turn workflows with long-term memory, KV cache capacity requirements grow linearly with sequence length. This creates unprecedented pressure on existing memory hierarchies. Organizations face a false choice: store KV cache in scarce, expensive GPU HBM or resort to general-purpose storage tiers optimized for durability and data protection rather than AI inference patterns. Both approaches degrade performance and increase cost per token.
CMX Architecture & Capabilities
Built on the NVIDIA BlueField-4 processor within the Vera Rubin platform's STX rack, CMX establishes a purpose-built context memory tier with the following features:
- 5x higher tokens-per-second and 5x greater power efficiency compared to traditional storage solutions
- RDMA-accelerated, petabyte-scale context storage with low-latency, high-bandwidth connectivity via NVIDIA Spectrum-X Ethernet
- Seamless memory hierarchy extension that bridges GPU HBM and networked storage
- KV cache reuse and sharing across AI nodes, enabling stateless cache distribution at pod-level scale
- Integration with NVIDIA DOCA framework, Dynamo orchestration, and NIXL for coordinated cache placement and workload scheduling
Key Benefits
CMX minimizes inference stalls for long-context and agentic workloads while maximizing GPU utilization and reducing total cost of ownership. By holding reusable inference context and pre-staging it to GPUs, CMX enables higher throughput and responsiveness in production AI factories without requiring additional GPU HBM or sacrificing performance.
Integration with Vera Rubin
CMX is positioned as a core building block within the NVIDIA Vera Rubin platform, which organizes AI infrastructure into compute, networking, and storage racks serving as configurable components for large-scale AI factories. It complements Vera Rubin's support for the full AI lifecycle—from pretraining and post-training through test-time scaling and real-time agentic inference.