NVIDIA introduces CMX context memory storage for AI inference, claiming 5x higher throughput

Overview

NVIDIA has introduced CMX, a new context memory storage platform designed to address the scaling challenges of modern agentic AI systems. Built on the NVIDIA BlueField-4 processor and integrated into the Vera Rubin platform, CMX creates a dedicated storage tier optimized for serving Key-Value (KV) cache—the computational artifact that preserves transformer model context across inference steps without recomputation.

The Problem: Growing Context Requirements

As AI models scale toward trillions of parameters and context windows expand to millions of tokens, organizations face mounting pressure on memory hierarchies. Traditional approaches force a painful tradeoff: expensive GPU high-bandwidth memory (HBM) with limited capacity, or general-purpose storage tiers designed for durability rather than latency-sensitive inference workloads. This gap leaves GPUs underutilized and drives up both power consumption and cost-per-token.

CMX Architecture and Performance

CMX bridges this gap by providing:

Petabyte-scale context storage: Purpose-built for ephemeral, latency-sensitive KV cache serving at scale
RDMA acceleration: NVIDIA Spectrum-X Ethernet enables predictable, low-latency connectivity for consistent data access
5x performance improvement: 5x higher tokens-per-second (TPS) and 5x greater power efficiency versus traditional storage
Seamless GPU memory extension: Integrates into the Vera Rubin pod-level architecture across compute, networking, and storage racks

Integration and Orchestration

The platform leverages existing NVIDIA ecosystem tools:

NVIDIA DOCA framework: Manages data processing and context handling
NVIDIA Dynamo and NIXL: Coordinate KV block management, context placement, and workload scheduling across the memory hierarchy
Stateless KV cache sharing: Enables efficient reuse of cached context across AI nodes without duplication

Developer Impact

CMX enables AI providers to efficiently scale agentic, long-context workloads by decoupling memory-intensive KV storage from GPU compute. Developers and infrastructure teams can now build systems that maximize GPU utilization while reducing operational costs—critical for emerging agentic AI applications that maintain multi-turn conversations and persistent agent memory across extended sessions.

Overview

The Problem: Growing Context Requirements

CMX Architecture and Performance

Integration and Orchestration

Developer Impact

Tags

Published

Source