← Back
NVIDIA
NVIDIA Groq 3 LPX inference accelerator delivers 35x throughput-per-watt improvement for low-latency AI workloads
· releaseplatformperformancemodel · developer.nvidia.com ↗

NVIDIA Groq 3 LPX: A New Inference Accelerator for the AI Factory

NVIDIA has announced Groq 3 LPX, a rack-scale inference accelerator designed to complement the Vera Rubin NVL72 GPU platform. Built around 256 interconnected Groq 3 LPU accelerators, the system is optimized for low-latency inference workloads, particularly for agentic AI systems that demand both high throughput and predictable per-token latency.

Key Technical Specifications

The LPX system delivers impressive performance metrics:

  • 315 PFLOPS of FP8 compute performance at rack scale
  • 40 PB/s on-chip SRAM bandwidth per accelerator
  • 640 TB/s rack-scale interconnect bandwidth
  • 128 GB total SRAM capacity
  • Heterogeneous architecture pairing LPUs with Vera Rubin NVL72 GPUs

Architecture & Heterogeneous Serving

Rather than replacing existing GPU infrastructure, LPX creates a heterogeneous inference path where the system intelligently routes workload components:

  • Vera Rubin NVL72 handles prefill and decode attention, maintaining high throughput for long-context processing
  • Groq 3 LPX accelerates latency-sensitive FFN and MoE expert execution during token generation
  • NVIDIA Dynamo orchestrates request routing to optimize responsiveness without sacrificing overall AI factory throughput

Performance Claims & Use Cases

NVIDIA claims the combined Vera Rubin + LPX architecture delivers:

  • 35x higher inference throughput per megawatt compared to alternatives
  • 10x more revenue opportunity for trillion-parameter model serving
  • Deterministic, low-jitter execution for stable tail latencies even at high concurrency

The system targets emerging agentic AI workloads where generation speeds approach 1,000+ tokens per second per user, enabling "speed of thought" computing for multi-agent systems and real-time AI collaboration experiences.

Availability & Deployment

LPX is integrated with NVIDIA's MGX ETL rack architecture and can be deployed alongside Vera Rubin NVL72 within existing data center infrastructure. The system is designed for production inference serving in large-scale AI factories.