NVIDIA launches BlueField-4-powered CMX context memory storage, delivering 5x faster inference throughput for long-context AI workloads

New Context Memory Storage Tier for AI Inference

NVIDIA announced the CMX Context Memory Storage platform, a purpose-built storage infrastructure designed to address the growing demands of long-context agentic AI workloads. Integrated into the Vera Rubin AI platform, CMX serves as an optimized middle tier between GPU high-bandwidth memory (HBM) and general-purpose storage, specifically engineered to handle ephemeral, latency-sensitive KV cache at petabyte scale.

Key Performance Improvements

The platform delivers significant performance and efficiency gains:

5x higher tokens-per-second (TPS) compared to traditional storage
5x greater power efficiency than conventional storage solutions
RDMA-accelerated context storage with predictable, low-latency connectivity via NVIDIA Spectrum-X Ethernet
Seamless integration with the NVIDIA Vera Rubin architecture

Technical Architecture

CMX is powered by NVIDIA BlueField-4 data processing units and leverages NVIDIA DOCA framework for coordination across the memory hierarchy. The platform enables:

Stateless KV cache sharing across AI inference nodes
Efficient context placement and block management through orchestration tools like NVIDIA Dynamo and NIXL
Seamless extension of GPU memory across the POD infrastructure
Optimized handling of ephemeral, AI-native workloads that traditional durability-focused storage is not designed for

Use Case: Agentic AI and Long-Context Inference

As models scale toward trillions of parameters and context windows expand to millions of tokens, KV cache reuse becomes critical for both performance and cost efficiency. CMX enables agents to maintain long-term memory across multiple turns, sessions, and tools without forcing expensive GPU HBM exhaustion or relying on inefficient general-purpose storage. This is particularly important for agentic scaling workflows where context persists and is continuously reused rather than discarded after each request.

Developers and enterprises can now implement truly scalable long-context inference infrastructure with improved GPU utilization, reduced inference stalls, and lower total cost of ownership.

New Context Memory Storage Tier for AI Inference

Key Performance Improvements

Technical Architecture

Use Case: Agentic AI and Long-Context Inference

Tags

Published

Source