← Back
NVIDIA
NVIDIA launches CMX context memory platform to reduce AI inference costs by 5x with BlueField-4 storage tier
· releaseplatformintegrationperformance · developer.nvidia.com ↗

NVIDIA CMX: A Purpose-Built Storage Tier for Agentic AI

NVIDIA has introduced the CMX (Context Memory Storage) platform as a critical new tier in its Vera Rubin AI factory architecture. This storage system leverages NVIDIA BlueField-4 data processing units and is specifically designed to address growing scalability challenges posed by agentic AI workloads with million-token context windows and trillion-parameter models.

The Problem: Pressure on Memory Hierarchies

As AI models scale and context windows expand to millions of tokens, Key-Value (KV) cache—the mechanism that preserves inference context—grows proportionally. Organizations currently face a difficult choice: store KV cache in scarce, expensive GPU high-bandwidth memory (HBM) or relegate it to general-purpose storage optimized for durability rather than low-latency AI workloads. This forces higher power consumption, inflated cost-per-token, and GPU underutilization.

CMX Architecture and Performance

The CMX platform bridges this gap by providing a dedicated middle tier optimized for ephemeral, latency-sensitive inference context. Key capabilities include:

  • 5x higher tokens-per-second (TPS) throughput compared to traditional storage
  • 5x greater power efficiency for serving KV cache at scale
  • Petabyte-scale storage enabling scalable KV cache reuse across multiple inference sessions
  • NVIDIA Spectrum-X Ethernet integration for predictable, low-latency RDMA connectivity at gigascale
  • Seamless GPU memory extension within pod-level architecture alongside NVIDIA BlueField-4 networking

Integration with Vera Rubin Platform

CMX operates within NVIDIA's Vera Rubin platform, which organizes AI infrastructure into compute, networking, and storage racks as configurable building blocks. The platform is complemented by orchestration tools like NVIDIA Dynamo and NIXL, which coordinate context placement, KV block management, and workload scheduling across the memory hierarchy for stateless sharing across AI nodes.

Developer Impact

This infrastructure advancement enables organizations to efficiently handle agentic reasoning workloads that require persistent context across multiple turns and sessions. By reducing the pressure on GPU memory and traditional storage, developers can build more complex multi-agent systems while maintaining cost efficiency and throughput performance.