NVIDIA Groq 3 LPX: Low-Latency Inference Accelerator
NVIDIA has introduced the Groq 3 LPX, a new rack-scale inference accelerator co-designed with the Vera Rubin NVL72 GPU to meet the demanding requirements of agentic AI systems. The system is purpose-built for applications requiring both high throughput and predictable, low-latency token generation—a critical capability as AI systems move toward "speed of thought computing" with generation rates approaching 1,000 tokens per second per user.
Architecture and Specifications
The LPX system is built around 256 interconnected Groq 3 LPU accelerators arranged in 32 liquid-cooled 1U compute trays. Key performance metrics include:
- 315 PFLOPS of FP8 compute capacity
- 128 GB total on-chip SRAM
- 40 PB/s on-chip SRAM bandwidth
- 640 TB/s scale-up rack-level bandwidth
- Deterministic, compiler-orchestrated execution for stable per-token latency
The architecture emphasizes explicit data movement and high-radix chip-to-chip communication to minimize inference jitter and maintain predictable performance even under high concurrency.
Heterogeneous Inference with Vera Rubin
LPX is designed to work in tandem with Vera Rubin NVL72, creating a heterogeneous inference architecture. NVIDIA's Dynamo orchestration layer classifies requests and routes workloads:
- Vera Rubin GPUs handle prefill operations and decode attention (leveraging their flexible, general-purpose computing)
- Groq 3 LPX accelerates latency-sensitive FFN and MoE expert execution during decode
This division of labor enables data centers to maintain high aggregate AI factory throughput while delivering the responsive, low-latency interactive experiences required for agentic applications.
Performance Claims and Use Cases
NVIDIA claims LPX delivers 35x higher inference throughput per megawatt and 10x more revenue opportunity for trillion-parameter models compared to baseline configurations. The system targets emerging workloads including:
- Multi-agent systems requiring rapid agent-to-agent communication
- Long-context processing with stable tail latency
- Speculative decoding pipelines for LLMs
- Premium AI services demanding predictable response times
The LPX integrates with NVIDIA's MGX ETL rack architecture, allowing seamless deployment alongside Vera Rubin NVL72 within existing data center infrastructure.