New Storage Tier for AI Inference Context
NVIDIA has announced the CMX Context Memory Storage platform, a purpose-built storage tier for managing Key-Value (KV) cache in large-scale AI inference. Integrated into the Vera Rubin platform, CMX addresses a critical bottleneck in agentic AI systems where context windows reach millions of tokens and models scale to trillions of parameters.
The KV Cache Challenge
As AI models evolve from stateless chatbots to complex, multi-turn agentic workflows, managing inference context becomes increasingly critical. The KV cache—which preserves the model's computational history to avoid recomputing tokens—grows linearly with context length. Traditional memory hierarchies force operators to choose between scarce GPU HBM and general-purpose storage, neither optimized for ephemeral, latency-sensitive AI workloads. This drives up power consumption, increases cost-per-token, and leaves expensive GPUs underutilized.
CMX Architecture and Performance
Powered by the NVIDIA BlueField-4 processor, CMX establishes an intermediate storage tier optimized for reusable inference context. Key capabilities include:
- 5x higher tokens-per-second (TPS) compared to traditional storage
- 5x greater power efficiency for context serving
- RDMA-accelerated connectivity via NVIDIA Spectrum-X Ethernet for low-latency, predictable access to shared KV cache
- Petabyte-scale capacity for managing long-context, agentic workloads
- Seamless integration with the Vera Rubin platform's pod-level architecture
Integration and Use Cases
CMX is coordinated by NVIDIA's orchestration framework (DOCA, Dynamo, NIXL) to manage context placement, KV block allocation, and workload scheduling across the memory hierarchy. This enables:
- Efficient KV cache reuse across multiple inference requests and sessions
- Reduced inference latency for agentic systems with long-term memory requirements
- Stateless sharing of KV cache across AI nodes within a pod
- Improved GPU utilization and overall throughput in AI factories
The platform targets organizations running agentic AI workflows requiring persistent context across multi-turn interactions, tool invocations, and extended reasoning sessions.