Overview
NVIDIA has introduced CMX, a new context memory storage platform designed to address the scaling challenges of modern agentic AI systems. Built on the NVIDIA BlueField-4 processor and integrated into the Vera Rubin platform, CMX creates a dedicated storage tier optimized for serving Key-Value (KV) cache—the computational artifact that preserves transformer model context across inference steps without recomputation.
The Problem: Growing Context Requirements
As AI models scale toward trillions of parameters and context windows expand to millions of tokens, organizations face mounting pressure on memory hierarchies. Traditional approaches force a painful tradeoff: expensive GPU high-bandwidth memory (HBM) with limited capacity, or general-purpose storage tiers designed for durability rather than latency-sensitive inference workloads. This gap leaves GPUs underutilized and drives up both power consumption and cost-per-token.
CMX Architecture and Performance
CMX bridges this gap by providing:
- Petabyte-scale context storage: Purpose-built for ephemeral, latency-sensitive KV cache serving at scale
- RDMA acceleration: NVIDIA Spectrum-X Ethernet enables predictable, low-latency connectivity for consistent data access
- 5x performance improvement: 5x higher tokens-per-second (TPS) and 5x greater power efficiency versus traditional storage
- Seamless GPU memory extension: Integrates into the Vera Rubin pod-level architecture across compute, networking, and storage racks
Integration and Orchestration
The platform leverages existing NVIDIA ecosystem tools:
- NVIDIA DOCA framework: Manages data processing and context handling
- NVIDIA Dynamo and NIXL: Coordinate KV block management, context placement, and workload scheduling across the memory hierarchy
- Stateless KV cache sharing: Enables efficient reuse of cached context across AI nodes without duplication
Developer Impact
CMX enables AI providers to efficiently scale agentic, long-context workloads by decoupling memory-intensive KV storage from GPU compute. Developers and infrastructure teams can now build systems that maximize GPU utilization while reducing operational costs—critical for emerging agentic AI applications that maintain multi-turn conversations and persistent agent memory across extended sessions.