NVIDIA Introduces CMX Context Memory Storage Platform with 5x Higher Throughput for Long-Context AI Inference

Overview

The NVIDIA CMX context memory storage platform addresses a critical bottleneck in modern AI inference: managing massive Key-Value (KV) cache requirements as context windows expand to millions of tokens and models scale toward trillions of parameters. CMX fills a gap between GPU high-bandwidth memory (HBM) and general-purpose storage by providing a specialized, optimized tier for ephemeral, latency-sensitive context storage at gigascale.

The Problem

As agentic AI systems evolve from stateless chatbots to complex, multi-turn workflows with long-term memory, KV cache capacity requirements grow linearly with sequence length. This creates unprecedented pressure on existing memory hierarchies. Organizations face a false choice: store KV cache in scarce, expensive GPU HBM or resort to general-purpose storage tiers optimized for durability and data protection rather than AI inference patterns. Both approaches degrade performance and increase cost per token.

CMX Architecture & Capabilities

Built on the NVIDIA BlueField-4 processor within the Vera Rubin platform's STX rack, CMX establishes a purpose-built context memory tier with the following features:

5x higher tokens-per-second and 5x greater power efficiency compared to traditional storage solutions
RDMA-accelerated, petabyte-scale context storage with low-latency, high-bandwidth connectivity via NVIDIA Spectrum-X Ethernet
Seamless memory hierarchy extension that bridges GPU HBM and networked storage
KV cache reuse and sharing across AI nodes, enabling stateless cache distribution at pod-level scale
Integration with NVIDIA DOCA framework, Dynamo orchestration, and NIXL for coordinated cache placement and workload scheduling

Key Benefits

CMX minimizes inference stalls for long-context and agentic workloads while maximizing GPU utilization and reducing total cost of ownership. By holding reusable inference context and pre-staging it to GPUs, CMX enables higher throughput and responsiveness in production AI factories without requiring additional GPU HBM or sacrificing performance.

Integration with Vera Rubin

CMX is positioned as a core building block within the NVIDIA Vera Rubin platform, which organizes AI infrastructure into compute, networking, and storage racks serving as configurable components for large-scale AI factories. It complements Vera Rubin's support for the full AI lifecycle—from pretraining and post-training through test-time scaling and real-time agentic inference.

Overview

The Problem

CMX Architecture & Capabilities

Key Benefits

Integration with Vera Rubin

Tags

Published

Source