← Back
NVIDIA
NVIDIA launches Groq 3 LPX inference accelerator with 35x higher throughput per megawatt
· releaseplatformperformanceapi · developer.nvidia.com ↗

NVIDIA Groq 3 LPX: Low-Latency Inference at Scale

NVIDIA has announced the Groq 3 LPX, a new rack-scale inference accelerator designed for the demands of agentic AI workloads requiring both high throughput and predictable low latency. The system is built around 256 interconnected NVIDIA Groq 3 LPU (Latency Processing Unit) accelerators, each optimized for fast, deterministic token generation in response to user interactions.

Architecture and Performance

The LPX delivers significant performance characteristics:

  • 315 PFLOPS of FP8 compute across the rack
  • 40 PB/s on-chip SRAM bandwidth per-chip
  • 640 TB/s scale-up bandwidth across all 256 chips
  • 128 GB total SRAM capacity
  • Up to 35x higher inference throughput per megawatt compared to alternatives
  • Up to 10x more revenue opportunity for trillion-parameter models

The system emphasizes deterministic, compiler-orchestrated execution and explicit data movement to minimize inference jitter and deliver stable, predictable per-token latency even at high concurrency levels.

Heterogeneous Serving with Vera Rubin NVL72

LPX is co-designed to work alongside NVIDIA's Vera Rubin NVL72 GPU, creating a heterogeneous inference architecture optimized for different workload phases. While Vera Rubin NVL72 handles prefill and decode attention (which benefit from general-purpose compute), the LPX accelerates the latency-sensitive portions of decode including feedforward networks (FFN) and mixture-of-experts (MoE) layers. NVIDIA Dynamo orchestrates this disaggregated serving, routing requests intelligently to maintain both high factory throughput and responsive interactive performance.

Target Use Cases

The LPX enables next-generation AI applications including:

  • Multi-agent systems requiring coordinated agent inference at scale
  • Speed-of-thought computing supporting generation speeds approaching 1,000 tokens per second per user
  • Long-context agentic workloads maintaining performance across trillion-parameter models

The combination of Vera Rubin NVL72 and LPX addresses a critical infrastructure gap: supporting continuous, reasoning-intensive agentic AI that demands both sustained throughput for the AI factory and ultra-responsive latency for interactive experiences.