Overview
NVIDIA has launched the Groq 3 LPX, a new rack-scale inference accelerator purpose-built for the demands of next-generation agentic AI systems. The LPX is co-designed with NVIDIA's Vera Rubin NVL72 GPU to form a heterogeneous inference architecture that balances high throughput with ultra-low latency—critical for applications requiring fast, predictable token generation at scale.
Key Specifications and Performance
The LPX system is built around 256 interconnected NVIDIA Groq 3 LPU accelerators and delivers impressive rack-scale metrics:
- 315 PFLOPS of FP8 inference compute
- 128 GB total SRAM capacity
- 40 PB/s on-chip SRAM bandwidth
- 640 TB/s scale-up chip-to-chip bandwidth
- Up to 35x higher inference throughput per megawatt compared to prior generations
- Up to 10x more revenue opportunity for trillion-parameter model serving
Architecture and Deployment Model
The LPX leverages a deterministic, compiler-orchestrated execution model with explicit data movement and high-bandwidth on-chip memory to minimize inference jitter and deliver stable per-token latency even under high concurrency. When deployed alongside Vera Rubin NVL72, the system uses heterogeneous decode orchestration:
- Vera Rubin NVL72 GPUs handle prefill and decode attention operations
- LPX LPU accelerators handle latency-sensitive FFN (feedforward network) and MoE (mixture of experts) decode operations
- NVIDIA Dynamo orchestrates request classification and disaggregated serving across both systems
This split keeps interactive inference responsive while maintaining high AI factory throughput for batch processing.
Use Cases and Impact
The LPX is optimized for emerging AI workloads including:
- Multi-agent systems requiring coordinated execution of multiple specialized agents
- Agentic AI applications operating at "speed of thought" (approaching 1,000 tokens per second per user)
- Long-context processing with large-context windows
- Speculative decoding for LLM acceleration
- Real-time collaborative AI experiences that feel less like turn-based chat and more like continuous interaction
The combination of Vera Rubin NVL72 and LPX addresses a fundamental shift in AI infrastructure needs: supporting both the high aggregate token production required by large-scale AI factories and the low, predictable latency essential for interactive agentic systems.