NVIDIA unveils Groq 3 LPX inference accelerator; delivers 35x throughput-per-watt improvement for agentic AI

NVIDIA Groq 3 LPX: Low-Latency Inference for Agentic AI

NVIDIA has announced the Groq 3 LPX, a new rack-scale inference accelerator designed to power next-generation agentic AI systems that require both high throughput and ultra-low latency. The system is co-designed with the Vera Rubin NVL72 GPU to create a heterogeneous inference architecture where each component handles workloads it optimizes for.

Key Specifications and Performance

The LPX rack-scale system delivers impressive performance metrics:

315 PFLOPS of FP8 inference compute
128 GB total SRAM capacity
40 PB/s on-chip SRAM bandwidth
640 TB/s scale-up bandwidth across 256 chips
35x higher inference throughput per megawatt compared to alternatives
10x more revenue opportunity for trillion-parameter models

The system is built around 256 interconnected Groq 3 LPU accelerators arranged in 32 liquid-cooled 1U compute trays, emphasizing deterministic execution and high-speed communication to minimize inference jitter.

Architecture and Heterogeneous Serving

LPX operates as a complement to Vera Rubin NVL72 within the broader Vera Rubin platform. The heterogeneous architecture distributes inference workloads strategically:

Prefill and decode attention run on Vera Rubin NVL72 GPUs for high throughput
Latency-sensitive FFN and MoE expert execution run on LPX for fast token generation
NVIDIA Dynamo orchestrates request routing and disaggregated serving to maintain responsiveness

Use Cases and Vision

The combination enables new capabilities for emerging agentic workloads:

Multi-agent systems that coordinate to accomplish complex tasks
Speed-of-thought computing approaching 1,000 tokens per second per user
Large-context processing with stable, predictable per-token latency even at high concurrency
Speculative decoding for LLMs alongside multi-agent coordination

The LPX integrates with NVIDIA's MGX ETL rack architecture, allowing data centers to deploy dedicated low-latency inference paths alongside existing Vera Rubin NVL72 infrastructure within a unified design.

NVIDIA Groq 3 LPX: Low-Latency Inference for Agentic AI

Key Specifications and Performance

Architecture and Heterogeneous Serving

Use Cases and Vision

Tags

Published

Source