NVIDIA introduces CMX context memory storage platform powered by BlueField-4; delivers 5x throughput gains for long-context AI inference

New Memory Tier for Long-Context AI Workloads

NVIDIA's CMX context memory storage platform addresses a critical bottleneck in modern AI inference: efficiently storing and serving Key-Value (KV) cache as context windows grow to millions of tokens. The system is integrated into NVIDIA's Vera Rubin platform as a dedicated infrastructure tier, sitting between GPU high-bandwidth memory (HBM) and general-purpose storage.

Architecture and Key Benefits

CMX leverages several NVIDIA technologies:

BlueField-4 processors provide intelligent data processing and acceleration
Spectrum-X Ethernet delivers low-latency, high-bandwidth RDMA connectivity for consistent access to shared KV cache
NVIDIA DOCA framework enables optimized context placement and KV block management

The platform delivers significant performance improvements: 5x higher tokens-per-second (TPS) compared to traditional storage and 5x greater power efficiency, reducing cost-per-token for long-context inference while maximizing GPU utilization.

Solving Agentic AI Scaling Challenges

As AI systems evolve from stateless chatbots to complex, multi-turn agentic workflows with trillions of parameters, KV cache has become critical long-term memory. Growing context windows (millions of tokens) create a dilemma: GPU HBM is scarce and expensive, while general-purpose storage tiers are optimized for durability, not ephemeral AI workloads.

CMX fills this gap by providing petabyte-scale context storage specifically optimized for latency-sensitive, reusable inference context. The system enables efficient KV cache sharing across AI nodes within a pod-level architecture, reducing inference stalls and improving responsiveness for agentic, long-context workloads.

Integration with Vera Rubin AI Factories

CMX is deployed within the NVIDIA Vera Rubin platform's pod-level architecture, which organizes AI infrastructure into compute, networking, and storage racks as configurable building blocks. The platform supports coordination across the memory hierarchy using orchestration tools like NVIDIA Dynamo and NIXL, ensuring optimal context placement and workload scheduling.

New Memory Tier for Long-Context AI Workloads

Architecture and Key Benefits

Solving Agentic AI Scaling Challenges

Integration with Vera Rubin AI Factories

Tags

Published

Source