← Back
NVIDIA
NVIDIA launches CMX context memory platform with BlueField-4, delivering 5x higher throughput for long-context AI inference
· platformfeatureperformancerelease · developer.nvidia.com ↗

Context Storage Challenge

As agentic AI systems scale to handle millions of tokens and models reach trillions of parameters, organizations face a critical bottleneck: KV cache management. Traditional memory hierarchies force a difficult choice between scarce GPU high-bandwidth memory (HBM) and general-purpose storage optimized for durability rather than latency-sensitive inference. This drives up power consumption, inflates cost per token, and leaves expensive GPUs underutilized.

NVIDIA CMX Platform

NVIDIA CMX establishes an optimized intermediate storage tier purpose-built for long-context, agentic reasoning. Powered by NVIDIA BlueField-4 processors, CMX acts as a high-bandwidth, flash-based memory tier that holds latency-sensitive, reusable inference context and prestages it to increase GPU utilization. The platform integrates seamlessly with NVIDIA Spectrum-X Ethernet to provide predictable, low-latency RDMA connectivity for consistent data access to shared KV cache at scale.

Key Capabilities and Performance

  • 5x higher tokens-per-second throughput compared to traditional storage solutions
  • 5x greater power efficiency for inference workloads
  • Petabyte-scale context storage supporting scalable KV cache reuse
  • Seamless integration with the Vera Rubin AI factory architecture
  • RDMA-accelerated access for minimal latency and jitter
  • Stateless KV cache sharing across AI nodes to maximize utilization

Integration with Vera Rubin

CMX is integrated into the NVIDIA Vera Rubin platform, which organizes AI infrastructure into compute, networking, and storage racks as configurable building blocks for AI factories. The NVIDIA DOCA framework, Spectrum-X Ethernet, and orchestration tools like NVIDIA Dynamo coordinate context placement and workload scheduling across the memory hierarchy, enabling efficient inference scaling from pretraining through real-time agentic deployment.