NVIDIA announces Groq 3 LPX inference accelerator; promises 35x higher throughput per megawatt for agentic AI

NVIDIA Introduces Groq 3 LPX Inference Accelerator

NVIDIA announced the Groq 3 LPX, a new rack-scale inference accelerator purpose-built for low-latency, high-throughput inference in agentic AI systems. The LPX is co-designed with the NVIDIA Vera Rubin NVL72 platform and delivers up to 35x higher inference throughput per megawatt and up to 10x more revenue opportunity for trillion-parameter models.

Architecture and Key Specifications

The LPX system is built around 256 interconnected NVIDIA Groq 3 LPU accelerators, each optimized for deterministic, predictable execution. Key specs include:

315 PFLOPS of FP8 compute at rack scale
40 PB/s on-chip SRAM bandwidth per chip
640 TB/s scale-up bandwidth across 256 chips
128 GB total SRAM capacity
Deterministic, compiler-orchestrated execution with explicit data movement

Heterogeneous Inference Architecture

The LPX is designed to pair with Vera Rubin NVL72 GPUs in a heterogeneous serving configuration. Vera Rubin GPUs handle prefill and decode attention, while LPX accelerators handle the latency-sensitive FFN and MoE expert execution during the decode loop. This split workload maintains high AI factory throughput while delivering the responsive, predictable per-token latency required for interactive agentic systems.

NVIDIA's Dynamo orchestration layer routes requests between Rubin and LPX, classifying workloads to determine optimal placement and ensuring stable tail latency even at high concurrency.

Why This Matters

As generative AI advances toward "speed of thought" computing (1,000+ tokens per second per user), infrastructure must support both high throughput and low latency. Multi-agent systems and complex agentic workflows demand this dual capability. The combination of Vera Rubin NVL72 and LPX enables data centers to deploy a dedicated low-latency inference path within a common MGX ETL rack architecture, supporting next-generation AI applications without sacrificing aggregate throughput.

NVIDIA Introduces Groq 3 LPX Inference Accelerator

Architecture and Key Specifications

Heterogeneous Inference Architecture

Why This Matters

Tags

Published

Source