NVIDIA launches CMX context memory storage platform powered by BlueField-4; claims 5x throughput and efficiency gains

New Context Memory Tier for AI Inference

NVIDIA has announced CMX (Context Memory Storage), a purpose-built storage platform within the Vera Rubin AI factory infrastructure that addresses growing challenges in serving large-scale AI inference. As agentic AI systems evolve with context windows spanning millions of tokens and models approaching trillions of parameters, traditional memory hierarchies struggle to efficiently manage Key-Value (KV) cache—the critical data structure that preserves inference context and prevents recomputation of history.

Architecture and Technical Capabilities

CMX is powered by NVIDIA BlueField-4 processors and organized as a new tier in the Vera Rubin pod-level architecture. Key features include:

5x higher tokens-per-second throughput compared to traditional storage solutions
5x greater power efficiency for serving ephemeral KV cache
Petabyte-scale capacity with RDMA-accelerated access via Spectrum-X Ethernet
Ultra-low latency connectivity ensuring consistent, predictable performance at scale
Seamless GPU memory extension across compute pods

The platform bridges the gap between GPU high-bandwidth memory (HBM)—which has limited capacity—and general-purpose storage, which is optimized for durability rather than the latency-sensitive, ephemeral nature of inference context.

Use Cases and Benefits

CMX addresses critical pain points in modern AI deployments:

Efficient KV cache reuse across multiple inference requests, eliminating redundant computation
Long-context agentic workflows that maintain state across turns, tools, and sessions
Scalable long-term memory for AI agents building on prior reasoning
Reduced GPU underutilization by offloading context storage while maintaining performance

The platform coordinates with NVIDIA orchestration tools like Dynamo and NIXL for intelligent context placement and workload scheduling across the memory hierarchy.

Developer Action Items

Organizations adopting Vera Rubin infrastructure can integrate CMX into AI factory architectures to improve inference throughput and reduce operational costs for long-context and agentic workloads. The platform is designed to work seamlessly within NVIDIA's broader DOCA framework and Spectrum-X networking ecosystem.

New Context Memory Tier for AI Inference

Architecture and Technical Capabilities

Use Cases and Benefits

Developer Action Items

Tags

Published

Source