NVIDIA Introduces Groq 3 LPX Inference Accelerator
NVIDIA announced the Groq 3 LPX, a new rack-scale inference accelerator purpose-built for low-latency, high-throughput inference in agentic AI systems. The LPX is co-designed with the NVIDIA Vera Rubin NVL72 platform and delivers up to 35x higher inference throughput per megawatt and up to 10x more revenue opportunity for trillion-parameter models.
Architecture and Key Specifications
The LPX system is built around 256 interconnected NVIDIA Groq 3 LPU accelerators, each optimized for deterministic, predictable execution. Key specs include:
- 315 PFLOPS of FP8 compute at rack scale
- 40 PB/s on-chip SRAM bandwidth per chip
- 640 TB/s scale-up bandwidth across 256 chips
- 128 GB total SRAM capacity
- Deterministic, compiler-orchestrated execution with explicit data movement
Heterogeneous Inference Architecture
The LPX is designed to pair with Vera Rubin NVL72 GPUs in a heterogeneous serving configuration. Vera Rubin GPUs handle prefill and decode attention, while LPX accelerators handle the latency-sensitive FFN and MoE expert execution during the decode loop. This split workload maintains high AI factory throughput while delivering the responsive, predictable per-token latency required for interactive agentic systems.
NVIDIA's Dynamo orchestration layer routes requests between Rubin and LPX, classifying workloads to determine optimal placement and ensuring stable tail latency even at high concurrency.
Why This Matters
As generative AI advances toward "speed of thought" computing (1,000+ tokens per second per user), infrastructure must support both high throughput and low latency. Multi-agent systems and complex agentic workflows demand this dual capability. The combination of Vera Rubin NVL72 and LPX enables data centers to deploy a dedicated low-latency inference path within a common MGX ETL rack architecture, supporting next-generation AI applications without sacrificing aggregate throughput.