NVIDIA launches Groq 3 LPX inference accelerator for low-latency agentic AI, claims 35x throughput per megawatt advantage

NVIDIA Introduces Groq 3 LPX Inference Accelerator

NVIDIA has unveiled the Groq 3 LPX, a new rack-scale inference accelerator co-designed with the NVIDIA Vera Rubin NVL72 GPU for next-generation agentic AI systems. The platform is specifically optimized for low-latency, large-context inference workloads where predictable per-token generation speed is critical for interactive AI experiences.

Key Architecture and Performance

The LPX system is built around 256 interconnected Groq 3 LPU accelerators organized into 32 liquid-cooled compute trays. The architecture emphasizes deterministic, compiler-orchestrated execution to minimize inference jitter and deliver stable latency even under high concurrency:

315 PFLOPS of FP8 inference compute at rack scale
128 GB total SRAM capacity with 40 PB/s on-chip SRAM bandwidth
640 TB/s scale-up (chip-to-chip) bandwidth for coordinated rack-level execution
Up to 35x higher inference throughput per megawatt compared to prior solutions

Heterogeneous Serving Strategy

LPX is designed to work in tandem with Vera Rubin NVL72, creating a split-brain inference architecture:

Vera Rubin NVL72 handles prefill and decode attention (flexible, general-purpose)
Groq 3 LPX handles latency-sensitive FFN (feed-forward network) and MoE (mixture-of-experts) decode operations
NVIDIA Dynamo orchestrates request routing and disaggregated serving between the two systems

This heterogeneous approach allows data centers to sustain high overall AI factory throughput while delivering the sub-100ms tail latencies required for interactive and agentic AI applications.

Use Cases and Deployment

The system targets emerging workloads where speed of thought matters:

Multi-agent systems requiring coordinated reasoning across multiple AI agents
Long-context inference with stable latency across large token windows
High-concurrency serving where responsive per-token generation is competitive advantage
Speculative decoding for further acceleration of token generation

NVIDIA positions LPX as a natural complement to its broader Vera Rubin platform, deployable within existing MGX ETL rack infrastructure for seamless integration into existing data center deployments.

NVIDIA Introduces Groq 3 LPX Inference Accelerator

Key Architecture and Performance

Heterogeneous Serving Strategy

Use Cases and Deployment

Tags

Published

Source