← Back
NVIDIA
NVIDIA introduces CMX context memory storage for AI inference, claiming 5x higher throughput
· featureplatformperformanceapi · developer.nvidia.com ↗

Overview

NVIDIA has introduced CMX, a new context memory storage platform designed to address the scaling challenges of modern agentic AI systems. Built on the NVIDIA BlueField-4 processor and integrated into the Vera Rubin platform, CMX creates a dedicated storage tier optimized for serving Key-Value (KV) cache—the computational artifact that preserves transformer model context across inference steps without recomputation.

The Problem: Growing Context Requirements

As AI models scale toward trillions of parameters and context windows expand to millions of tokens, organizations face mounting pressure on memory hierarchies. Traditional approaches force a painful tradeoff: expensive GPU high-bandwidth memory (HBM) with limited capacity, or general-purpose storage tiers designed for durability rather than latency-sensitive inference workloads. This gap leaves GPUs underutilized and drives up both power consumption and cost-per-token.

CMX Architecture and Performance

CMX bridges this gap by providing:

  • Petabyte-scale context storage: Purpose-built for ephemeral, latency-sensitive KV cache serving at scale
  • RDMA acceleration: NVIDIA Spectrum-X Ethernet enables predictable, low-latency connectivity for consistent data access
  • 5x performance improvement: 5x higher tokens-per-second (TPS) and 5x greater power efficiency versus traditional storage
  • Seamless GPU memory extension: Integrates into the Vera Rubin pod-level architecture across compute, networking, and storage racks

Integration and Orchestration

The platform leverages existing NVIDIA ecosystem tools:

  • NVIDIA DOCA framework: Manages data processing and context handling
  • NVIDIA Dynamo and NIXL: Coordinate KV block management, context placement, and workload scheduling across the memory hierarchy
  • Stateless KV cache sharing: Enables efficient reuse of cached context across AI nodes without duplication

Developer Impact

CMX enables AI providers to efficiently scale agentic, long-context workloads by decoupling memory-intensive KV storage from GPU compute. Developers and infrastructure teams can now build systems that maximize GPU utilization while reducing operational costs—critical for emerging agentic AI applications that maintain multi-turn conversations and persistent agent memory across extended sessions.