NVIDIA introduces Groq 3 LPX inference accelerator, claims 35x throughput gain per megawatt for agentic AI

NVIDIA Groq 3 LPX: Low-Latency Inference at Rack Scale

NVIDIA has unveiled the Groq 3 LPX, a new rack-scale inference accelerator designed to power next-generation agentic AI systems. Co-engineered with the Vera Rubin NVL72 GPU, the LPX targets emerging workloads that demand both high throughput and ultra-low, predictable latency—enabling token generation speeds approaching 1,000 tokens per second per user.

The system architecture revolves around 256 interconnected Groq 3 LPU accelerators operating deterministically through compiler-orchestrated execution. Key specifications include:

315 PFLOPS of FP8 compute capacity
40 PB/s on-chip SRAM bandwidth
640 TB/s scale-up chip-to-chip bandwidth
128 GB total distributed SRAM
Integrated into NVIDIA MGX ETL rack infrastructure

Heterogeneous Serving Architecture

The LPX does not replace Vera Rubin NVL72 GPUs but complements them in a heterogeneous serving strategy. While Rubin handles prefill, decode attention, and throughput-intensive workloads, the LPX specializes in latency-sensitive decode operations—specifically FFN (feedforward network) and MoE (mixture-of-experts) execution. This division of labor allows data centers to maintain high aggregate token production while delivering responsive interactive experiences.

NVIDIA claims this pairing delivers up to 35x higher inference throughput per megawatt and 10x more revenue opportunity for trillion-parameter models compared to prior configurations.

Enabling Multi-Agent and Agentic Systems

The architecture targets use cases where speed of thought computing matters: multi-agent coordination, speculative decoding, and continuous reasoning loops that operate faster than traditional conversational turn-taking. With stable, predictable per-token latency even under high concurrency, the LPX is positioned for mission-critical interactive AI services where tail latency directly impacts user experience.

The system is available as part of the broader NVIDIA Vera Rubin platform and can be deployed alongside standard data center infrastructure using the MGX ETL rack design.

NVIDIA Groq 3 LPX: Low-Latency Inference at Rack Scale

Heterogeneous Serving Architecture

Enabling Multi-Agent and Agentic Systems

Tags

Published

Source