NVIDIA releases Groq 3 LPX inference accelerator for low-latency agentic AI, delivering 35x higher throughput per megawatt

Overview

NVIDIA has launched the Groq 3 LPX, a new rack-scale inference accelerator purpose-built for the demands of next-generation agentic AI systems. The LPX is co-designed with NVIDIA's Vera Rubin NVL72 GPU to form a heterogeneous inference architecture that balances high throughput with ultra-low latency—critical for applications requiring fast, predictable token generation at scale.

Key Specifications and Performance

The LPX system is built around 256 interconnected NVIDIA Groq 3 LPU accelerators and delivers impressive rack-scale metrics:

315 PFLOPS of FP8 inference compute
128 GB total SRAM capacity
40 PB/s on-chip SRAM bandwidth
640 TB/s scale-up chip-to-chip bandwidth
Up to 35x higher inference throughput per megawatt compared to prior generations
Up to 10x more revenue opportunity for trillion-parameter model serving

Architecture and Deployment Model

The LPX leverages a deterministic, compiler-orchestrated execution model with explicit data movement and high-bandwidth on-chip memory to minimize inference jitter and deliver stable per-token latency even under high concurrency. When deployed alongside Vera Rubin NVL72, the system uses heterogeneous decode orchestration:

Vera Rubin NVL72 GPUs handle prefill and decode attention operations
LPX LPU accelerators handle latency-sensitive FFN (feedforward network) and MoE (mixture of experts) decode operations
NVIDIA Dynamo orchestrates request classification and disaggregated serving across both systems

This split keeps interactive inference responsive while maintaining high AI factory throughput for batch processing.

Use Cases and Impact

The LPX is optimized for emerging AI workloads including:

Multi-agent systems requiring coordinated execution of multiple specialized agents
Agentic AI applications operating at "speed of thought" (approaching 1,000 tokens per second per user)
Long-context processing with large-context windows
Speculative decoding for LLM acceleration
Real-time collaborative AI experiences that feel less like turn-based chat and more like continuous interaction

The combination of Vera Rubin NVL72 and LPX addresses a fundamental shift in AI infrastructure needs: supporting both the high aggregate token production required by large-scale AI factories and the low, predictable latency essential for interactive agentic systems.

Overview

Key Specifications and Performance

Architecture and Deployment Model

Use Cases and Impact

Tags

Published

Source