New Context Memory Storage Tier for AI Inference
NVIDIA announced the CMX Context Memory Storage platform, a purpose-built storage infrastructure designed to address the growing demands of long-context agentic AI workloads. Integrated into the Vera Rubin AI platform, CMX serves as an optimized middle tier between GPU high-bandwidth memory (HBM) and general-purpose storage, specifically engineered to handle ephemeral, latency-sensitive KV cache at petabyte scale.
Key Performance Improvements
The platform delivers significant performance and efficiency gains:
- 5x higher tokens-per-second (TPS) compared to traditional storage
- 5x greater power efficiency than conventional storage solutions
- RDMA-accelerated context storage with predictable, low-latency connectivity via NVIDIA Spectrum-X Ethernet
- Seamless integration with the NVIDIA Vera Rubin architecture
Technical Architecture
CMX is powered by NVIDIA BlueField-4 data processing units and leverages NVIDIA DOCA framework for coordination across the memory hierarchy. The platform enables:
- Stateless KV cache sharing across AI inference nodes
- Efficient context placement and block management through orchestration tools like NVIDIA Dynamo and NIXL
- Seamless extension of GPU memory across the POD infrastructure
- Optimized handling of ephemeral, AI-native workloads that traditional durability-focused storage is not designed for
Use Case: Agentic AI and Long-Context Inference
As models scale toward trillions of parameters and context windows expand to millions of tokens, KV cache reuse becomes critical for both performance and cost efficiency. CMX enables agents to maintain long-term memory across multiple turns, sessions, and tools without forcing expensive GPU HBM exhaustion or relying on inefficient general-purpose storage. This is particularly important for agentic scaling workflows where context persists and is continuously reused rather than discarded after each request.
Developers and enterprises can now implement truly scalable long-context inference infrastructure with improved GPU utilization, reduced inference stalls, and lower total cost of ownership.