← Back
NVIDIA
NVIDIA CMX Context Memory Platform Delivers 5x Throughput Gains for Long-Context AI Inference
· releaseplatformfeatureperformance · developer.nvidia.com ↗

NVIDIA CMX Context Memory Storage Platform

NVIDIA has introduced CMX (Context Memory Storage), a new tier in its Vera Rubin AI infrastructure platform purpose-built for long-context, agentic AI inference. The platform addresses a critical bottleneck in modern AI systems: as context windows expand to millions of tokens and models scale to trillions of parameters, Key-Value (KV) cache management becomes increasingly challenging.

The Problem: Memory Hierarchy Mismatch

Traditional infrastructure forces organizations to choose between GPU high-bandwidth memory (HBM), which is scarce and expensive, and general-purpose storage tiers optimized for durability rather than latency-sensitive AI workloads. As agentic AI systems require persistent context across multiple turns and sessions, KV cache capacity requirements grow linearly while recomputation costs explode exponentially, making efficient reuse essential.

How CMX Works

CMX, powered by NVIDIA BlueField-4 data processing units, creates an optimized intermediate storage tier that bridges GPU memory and traditional storage infrastructure. Key features include:

  • 5x higher tokens-per-second throughput compared to traditional storage systems
  • 5x greater power efficiency through purpose-built optimization for ephemeral KV cache
  • RDMA acceleration via NVIDIA Spectrum-X Ethernet for low-latency, predictable data access
  • Petabyte-scale capacity enabling scalable KV cache sharing across AI factory pods
  • Seamless GPU memory extension allowing agentic workflows to work with millions-token contexts

Integration with Vera Rubin Platform

CMX integrates into NVIDIA's Vera Rubin platform—a modular infrastructure for the full AI lifecycle from pretraining through agentic inference. The platform uses NVIDIA orchestration tools (NVIDIA Dynamo and NIXL) to manage context placement, KV block scheduling, and workload distribution across the enhanced memory hierarchy. This enables stateless sharing of KV cache across multiple AI nodes while maximizing GPU utilization and reducing inference latency.

Target Use Cases

The platform is designed for organizations deploying complex, multi-turn agentic AI systems that require long-term memory preservation, tool integration, and iterative reasoning. By efficiently managing context storage, CMX enables AI factories to scale inference operations without proportionally increasing capital and operational expenses.

NVIDIA CMX is available as part of the Vera Rubin platform for AI infrastructure deployments.