← Back
NVIDIA
NVIDIA launches CMX context memory storage platform powered by BlueField-4; claims 5x throughput and efficiency gains
· releasefeatureplatformperformance · developer.nvidia.com ↗

New Context Memory Tier for AI Inference

NVIDIA has announced CMX (Context Memory Storage), a purpose-built storage platform within the Vera Rubin AI factory infrastructure that addresses growing challenges in serving large-scale AI inference. As agentic AI systems evolve with context windows spanning millions of tokens and models approaching trillions of parameters, traditional memory hierarchies struggle to efficiently manage Key-Value (KV) cache—the critical data structure that preserves inference context and prevents recomputation of history.

Architecture and Technical Capabilities

CMX is powered by NVIDIA BlueField-4 processors and organized as a new tier in the Vera Rubin pod-level architecture. Key features include:

  • 5x higher tokens-per-second throughput compared to traditional storage solutions
  • 5x greater power efficiency for serving ephemeral KV cache
  • Petabyte-scale capacity with RDMA-accelerated access via Spectrum-X Ethernet
  • Ultra-low latency connectivity ensuring consistent, predictable performance at scale
  • Seamless GPU memory extension across compute pods

The platform bridges the gap between GPU high-bandwidth memory (HBM)—which has limited capacity—and general-purpose storage, which is optimized for durability rather than the latency-sensitive, ephemeral nature of inference context.

Use Cases and Benefits

CMX addresses critical pain points in modern AI deployments:

  • Efficient KV cache reuse across multiple inference requests, eliminating redundant computation
  • Long-context agentic workflows that maintain state across turns, tools, and sessions
  • Scalable long-term memory for AI agents building on prior reasoning
  • Reduced GPU underutilization by offloading context storage while maintaining performance

The platform coordinates with NVIDIA orchestration tools like Dynamo and NIXL for intelligent context placement and workload scheduling across the memory hierarchy.

Developer Action Items

Organizations adopting Vera Rubin infrastructure can integrate CMX into AI factory architectures to improve inference throughput and reduce operational costs for long-context and agentic workloads. The platform is designed to work seamlessly within NVIDIA's broader DOCA framework and Spectrum-X networking ecosystem.