← Back
NVIDIA
NVIDIA launches CMX context memory storage platform for agentic AI inference, achieving 5x higher throughput
· releaseplatformperformanceintegration · developer.nvidia.com ↗

NVIDIA CMX: A New Memory Tier for AI Inference

As AI models scale to trillions of parameters with context windows extending to millions of tokens, traditional memory hierarchies are becoming bottlenecks. Key-Value (KV) cache—the mechanism that preserves inference context across transformer models—now persists across longer sessions and must be shared across multiple inference services. NVIDIA's new CMX platform introduces a purpose-built storage tier to address this challenge.

What is CMX?

The NVIDIA CMX (Context Memory Storage) platform is a BlueField-4-powered storage tier integrated into the Vera Rubin AI infrastructure platform. It sits between GPU high-bandwidth memory (HBM) and general-purpose storage, optimized specifically for ephemeral, latency-sensitive KV cache at gigascale. CMX leverages NVIDIA Spectrum-X Ethernet for predictable, low-latency RDMA connectivity, enabling efficient sharing of KV cache across AI nodes with minimal jitter.

Key Performance Improvements

  • 5x higher throughput: CMX delivers up to 5x greater tokens-per-second (TPS) compared to traditional storage solutions
  • 5x better power efficiency: Optimized for AI-native workloads, reducing power consumption per token
  • Petabyte-scale capacity: Enables persistent KV cache reuse across agentic workflows and long-context inference sessions
  • Reduced GPU stalls: Minimizes computation interruptions by prestaging inference context efficiently

Architecture and Integration

CMX operates as part of NVIDIA's Vera Rubin platform, which organizes AI infrastructure into compute, networking, and storage racks as configurable building blocks. The DOCA framework and orchestration tools like NVIDIA Dynamo and NIXL coordinate context placement and KV block management across the memory hierarchy. This enables stateless sharing of KV cache and maximizes throughput and responsiveness for next-generation AI factories.

Implications for AI Developers

Organizations scaling agentic AI systems can now reduce pressure on expensive GPU memory by offloading ephemeral KV cache to CMX. This approach maintains GPU utilization while supporting longer context windows and multi-turn reasoning workflows. Developers building long-context and agentic AI applications should evaluate CMX as part of their infrastructure planning for improved cost-per-token and overall system efficiency.