NVIDIA Groq 3 LPX: Low-Latency Inference at Rack Scale
NVIDIA has unveiled the Groq 3 LPX, a new rack-scale inference accelerator designed to power next-generation agentic AI systems. Co-engineered with the Vera Rubin NVL72 GPU, the LPX targets emerging workloads that demand both high throughput and ultra-low, predictable latency—enabling token generation speeds approaching 1,000 tokens per second per user.
The system architecture revolves around 256 interconnected Groq 3 LPU accelerators operating deterministically through compiler-orchestrated execution. Key specifications include:
- 315 PFLOPS of FP8 compute capacity
- 40 PB/s on-chip SRAM bandwidth
- 640 TB/s scale-up chip-to-chip bandwidth
- 128 GB total distributed SRAM
- Integrated into NVIDIA MGX ETL rack infrastructure
Heterogeneous Serving Architecture
The LPX does not replace Vera Rubin NVL72 GPUs but complements them in a heterogeneous serving strategy. While Rubin handles prefill, decode attention, and throughput-intensive workloads, the LPX specializes in latency-sensitive decode operations—specifically FFN (feedforward network) and MoE (mixture-of-experts) execution. This division of labor allows data centers to maintain high aggregate token production while delivering responsive interactive experiences.
NVIDIA claims this pairing delivers up to 35x higher inference throughput per megawatt and 10x more revenue opportunity for trillion-parameter models compared to prior configurations.
Enabling Multi-Agent and Agentic Systems
The architecture targets use cases where speed of thought computing matters: multi-agent coordination, speculative decoding, and continuous reasoning loops that operate faster than traditional conversational turn-taking. With stable, predictable per-token latency even under high concurrency, the LPX is positioned for mission-critical interactive AI services where tail latency directly impacts user experience.
The system is available as part of the broader NVIDIA Vera Rubin platform and can be deployed alongside standard data center infrastructure using the MGX ETL rack design.