← Back
NVIDIA
NVIDIA unveils Groq 3 LPX inference accelerator; claims 35x throughput-per-watt improvement for low-latency AI workloads
· releaseplatformapiperformance · developer.nvidia.com ↗

NVIDIA Groq 3 LPX: Low-Latency Inference Accelerator

NVIDIA has introduced the Groq 3 LPX, a new rack-scale inference accelerator co-designed with the Vera Rubin NVL72 GPU to meet the demanding requirements of agentic AI systems. The system is purpose-built for applications requiring both high throughput and predictable, low-latency token generation—a critical capability as AI systems move toward "speed of thought computing" with generation rates approaching 1,000 tokens per second per user.

Architecture and Specifications

The LPX system is built around 256 interconnected Groq 3 LPU accelerators arranged in 32 liquid-cooled 1U compute trays. Key performance metrics include:

  • 315 PFLOPS of FP8 compute capacity
  • 128 GB total on-chip SRAM
  • 40 PB/s on-chip SRAM bandwidth
  • 640 TB/s scale-up rack-level bandwidth
  • Deterministic, compiler-orchestrated execution for stable per-token latency

The architecture emphasizes explicit data movement and high-radix chip-to-chip communication to minimize inference jitter and maintain predictable performance even under high concurrency.

Heterogeneous Inference with Vera Rubin

LPX is designed to work in tandem with Vera Rubin NVL72, creating a heterogeneous inference architecture. NVIDIA's Dynamo orchestration layer classifies requests and routes workloads:

  • Vera Rubin GPUs handle prefill operations and decode attention (leveraging their flexible, general-purpose computing)
  • Groq 3 LPX accelerates latency-sensitive FFN and MoE expert execution during decode

This division of labor enables data centers to maintain high aggregate AI factory throughput while delivering the responsive, low-latency interactive experiences required for agentic applications.

Performance Claims and Use Cases

NVIDIA claims LPX delivers 35x higher inference throughput per megawatt and 10x more revenue opportunity for trillion-parameter models compared to baseline configurations. The system targets emerging workloads including:

  • Multi-agent systems requiring rapid agent-to-agent communication
  • Long-context processing with stable tail latency
  • Speculative decoding pipelines for LLMs
  • Premium AI services demanding predictable response times

The LPX integrates with NVIDIA's MGX ETL rack architecture, allowing seamless deployment alongside Vera Rubin NVL72 within existing data center infrastructure.