NVIDIA launches CMX context memory storage platform for scaling agentic AI inference

New Storage Tier for AI Inference Context

NVIDIA has announced the CMX Context Memory Storage platform, a purpose-built storage tier for managing Key-Value (KV) cache in large-scale AI inference. Integrated into the Vera Rubin platform, CMX addresses a critical bottleneck in agentic AI systems where context windows reach millions of tokens and models scale to trillions of parameters.

The KV Cache Challenge

As AI models evolve from stateless chatbots to complex, multi-turn agentic workflows, managing inference context becomes increasingly critical. The KV cache—which preserves the model's computational history to avoid recomputing tokens—grows linearly with context length. Traditional memory hierarchies force operators to choose between scarce GPU HBM and general-purpose storage, neither optimized for ephemeral, latency-sensitive AI workloads. This drives up power consumption, increases cost-per-token, and leaves expensive GPUs underutilized.

CMX Architecture and Performance

Powered by the NVIDIA BlueField-4 processor, CMX establishes an intermediate storage tier optimized for reusable inference context. Key capabilities include:

5x higher tokens-per-second (TPS) compared to traditional storage
5x greater power efficiency for context serving
RDMA-accelerated connectivity via NVIDIA Spectrum-X Ethernet for low-latency, predictable access to shared KV cache
Petabyte-scale capacity for managing long-context, agentic workloads
Seamless integration with the Vera Rubin platform's pod-level architecture

Integration and Use Cases

CMX is coordinated by NVIDIA's orchestration framework (DOCA, Dynamo, NIXL) to manage context placement, KV block allocation, and workload scheduling across the memory hierarchy. This enables:

Efficient KV cache reuse across multiple inference requests and sessions
Reduced inference latency for agentic systems with long-term memory requirements
Stateless sharing of KV cache across AI nodes within a pod
Improved GPU utilization and overall throughput in AI factories

The platform targets organizations running agentic AI workflows requiring persistent context across multi-turn interactions, tool invocations, and extended reasoning sessions.

New Storage Tier for AI Inference Context

The KV Cache Challenge

CMX Architecture and Performance

Integration and Use Cases

Tags

Published

Source