NVIDIA Groq 3 LPX: Low-Latency Inference at Scale
NVIDIA has announced the Groq 3 LPX, a new rack-scale inference accelerator designed for the demands of agentic AI workloads requiring both high throughput and predictable low latency. The system is built around 256 interconnected NVIDIA Groq 3 LPU (Latency Processing Unit) accelerators, each optimized for fast, deterministic token generation in response to user interactions.
Architecture and Performance
The LPX delivers significant performance characteristics:
- 315 PFLOPS of FP8 compute across the rack
- 40 PB/s on-chip SRAM bandwidth per-chip
- 640 TB/s scale-up bandwidth across all 256 chips
- 128 GB total SRAM capacity
- Up to 35x higher inference throughput per megawatt compared to alternatives
- Up to 10x more revenue opportunity for trillion-parameter models
The system emphasizes deterministic, compiler-orchestrated execution and explicit data movement to minimize inference jitter and deliver stable, predictable per-token latency even at high concurrency levels.
Heterogeneous Serving with Vera Rubin NVL72
LPX is co-designed to work alongside NVIDIA's Vera Rubin NVL72 GPU, creating a heterogeneous inference architecture optimized for different workload phases. While Vera Rubin NVL72 handles prefill and decode attention (which benefit from general-purpose compute), the LPX accelerates the latency-sensitive portions of decode including feedforward networks (FFN) and mixture-of-experts (MoE) layers. NVIDIA Dynamo orchestrates this disaggregated serving, routing requests intelligently to maintain both high factory throughput and responsive interactive performance.
Target Use Cases
The LPX enables next-generation AI applications including:
- Multi-agent systems requiring coordinated agent inference at scale
- Speed-of-thought computing supporting generation speeds approaching 1,000 tokens per second per user
- Long-context agentic workloads maintaining performance across trillion-parameter models
The combination of Vera Rubin NVL72 and LPX addresses a critical infrastructure gap: supporting continuous, reasoning-intensive agentic AI that demands both sustained throughput for the AI factory and ultra-responsive latency for interactive experiences.