NVIDIA Introduces Groq 3 LPX Inference Accelerator
NVIDIA has unveiled the Groq 3 LPX, a new rack-scale inference accelerator co-designed with the NVIDIA Vera Rubin NVL72 GPU for next-generation agentic AI systems. The platform is specifically optimized for low-latency, large-context inference workloads where predictable per-token generation speed is critical for interactive AI experiences.
Key Architecture and Performance
The LPX system is built around 256 interconnected Groq 3 LPU accelerators organized into 32 liquid-cooled compute trays. The architecture emphasizes deterministic, compiler-orchestrated execution to minimize inference jitter and deliver stable latency even under high concurrency:
- 315 PFLOPS of FP8 inference compute at rack scale
- 128 GB total SRAM capacity with 40 PB/s on-chip SRAM bandwidth
- 640 TB/s scale-up (chip-to-chip) bandwidth for coordinated rack-level execution
- Up to 35x higher inference throughput per megawatt compared to prior solutions
Heterogeneous Serving Strategy
LPX is designed to work in tandem with Vera Rubin NVL72, creating a split-brain inference architecture:
- Vera Rubin NVL72 handles prefill and decode attention (flexible, general-purpose)
- Groq 3 LPX handles latency-sensitive FFN (feed-forward network) and MoE (mixture-of-experts) decode operations
- NVIDIA Dynamo orchestrates request routing and disaggregated serving between the two systems
This heterogeneous approach allows data centers to sustain high overall AI factory throughput while delivering the sub-100ms tail latencies required for interactive and agentic AI applications.
Use Cases and Deployment
The system targets emerging workloads where speed of thought matters:
- Multi-agent systems requiring coordinated reasoning across multiple AI agents
- Long-context inference with stable latency across large token windows
- High-concurrency serving where responsive per-token generation is competitive advantage
- Speculative decoding for further acceleration of token generation
NVIDIA positions LPX as a natural complement to its broader Vera Rubin platform, deployable within existing MGX ETL rack infrastructure for seamless integration into existing data center deployments.