NVIDIA Groq 3 LPX: Low-Latency Inference for Agentic AI
NVIDIA has announced the Groq 3 LPX, a new rack-scale inference accelerator designed to power next-generation agentic AI systems that require both high throughput and ultra-low latency. The system is co-designed with the Vera Rubin NVL72 GPU to create a heterogeneous inference architecture where each component handles workloads it optimizes for.
Key Specifications and Performance
The LPX rack-scale system delivers impressive performance metrics:
- 315 PFLOPS of FP8 inference compute
- 128 GB total SRAM capacity
- 40 PB/s on-chip SRAM bandwidth
- 640 TB/s scale-up bandwidth across 256 chips
- 35x higher inference throughput per megawatt compared to alternatives
- 10x more revenue opportunity for trillion-parameter models
The system is built around 256 interconnected Groq 3 LPU accelerators arranged in 32 liquid-cooled 1U compute trays, emphasizing deterministic execution and high-speed communication to minimize inference jitter.
Architecture and Heterogeneous Serving
LPX operates as a complement to Vera Rubin NVL72 within the broader Vera Rubin platform. The heterogeneous architecture distributes inference workloads strategically:
- Prefill and decode attention run on Vera Rubin NVL72 GPUs for high throughput
- Latency-sensitive FFN and MoE expert execution run on LPX for fast token generation
- NVIDIA Dynamo orchestrates request routing and disaggregated serving to maintain responsiveness
Use Cases and Vision
The combination enables new capabilities for emerging agentic workloads:
- Multi-agent systems that coordinate to accomplish complex tasks
- Speed-of-thought computing approaching 1,000 tokens per second per user
- Large-context processing with stable, predictable per-token latency even at high concurrency
- Speculative decoding for LLMs alongside multi-agent coordination
The LPX integrates with NVIDIA's MGX ETL rack architecture, allowing data centers to deploy dedicated low-latency inference paths alongside existing Vera Rubin NVL72 infrastructure within a unified design.