← Back
NVIDIA
NVIDIA launches BlueField-4-powered CMX storage tier; claims 5x higher tokens-per-second for long-context AI inference
· releaseplatformfeatureperformance · developer.nvidia.com ↗

Overview

NVIDIA announced the CMX (Context Memory Storage) platform, a specialized storage tier designed to address the growing demands of long-context and agentic AI inference workloads. Integrated within the Vera Rubin platform, CMX leverages NVIDIA BlueField-4 data processing units and Spectrum-X Ethernet to provide petabyte-scale, low-latency context storage optimized for ephemeral KV (Key-Value) cache management.

The Challenge

As AI models scale—with context windows stretching to millions of tokens and models reaching trillions of parameters—existing memory hierarchies face critical bottlenecks. Traditional approaches force organizations to choose between scarce GPU high-bandwidth memory (HBM) and general-purpose storage tiers optimized for durability, not for serving ephemeral, latency-sensitive inference context. This drives up power consumption, increases cost-per-token, and leaves expensive GPUs underutilized.

Key Capabilities

CMX establishes a new memory tier that:

  • Delivers 5x higher tokens-per-second throughput and is 5x more power efficient than traditional storage solutions
  • Provides RDMA-accelerated connectivity via Spectrum-X Ethernet for predictable, low-latency, high-bandwidth access to shared KV cache
  • Enables scalable KV cache reuse across AI nodes and inference services
  • Supports long-context agentic workflows by treating KV cache as persistent long-term memory across multiple turns and sessions

Integration & Architecture

CMX is purpose-built within the NVIDIA STX reference architecture and coordinates with existing tools like NVIDIA Dynamo and NIXL for context placement, KV block management, and workload scheduling. The system seamlessly extends GPU memory across infrastructure pods, enabling efficient, stateless sharing of cached context across multiple AI services and reducing inference stalls for mission-critical agentic applications.