← Back
NVIDIA
NVIDIA unveils Groq 3 LPX inference accelerator; delivers 35x throughput-per-watt improvement for agentic AI
· releaseplatformperformanceapi · developer.nvidia.com ↗

NVIDIA Groq 3 LPX: Low-Latency Inference for Agentic AI

NVIDIA has announced the Groq 3 LPX, a new rack-scale inference accelerator designed to power next-generation agentic AI systems that require both high throughput and ultra-low latency. The system is co-designed with the Vera Rubin NVL72 GPU to create a heterogeneous inference architecture where each component handles workloads it optimizes for.

Key Specifications and Performance

The LPX rack-scale system delivers impressive performance metrics:

  • 315 PFLOPS of FP8 inference compute
  • 128 GB total SRAM capacity
  • 40 PB/s on-chip SRAM bandwidth
  • 640 TB/s scale-up bandwidth across 256 chips
  • 35x higher inference throughput per megawatt compared to alternatives
  • 10x more revenue opportunity for trillion-parameter models

The system is built around 256 interconnected Groq 3 LPU accelerators arranged in 32 liquid-cooled 1U compute trays, emphasizing deterministic execution and high-speed communication to minimize inference jitter.

Architecture and Heterogeneous Serving

LPX operates as a complement to Vera Rubin NVL72 within the broader Vera Rubin platform. The heterogeneous architecture distributes inference workloads strategically:

  • Prefill and decode attention run on Vera Rubin NVL72 GPUs for high throughput
  • Latency-sensitive FFN and MoE expert execution run on LPX for fast token generation
  • NVIDIA Dynamo orchestrates request routing and disaggregated serving to maintain responsiveness

Use Cases and Vision

The combination enables new capabilities for emerging agentic workloads:

  • Multi-agent systems that coordinate to accomplish complex tasks
  • Speed-of-thought computing approaching 1,000 tokens per second per user
  • Large-context processing with stable, predictable per-token latency even at high concurrency
  • Speculative decoding for LLMs alongside multi-agent coordination

The LPX integrates with NVIDIA's MGX ETL rack architecture, allowing data centers to deploy dedicated low-latency inference paths alongside existing Vera Rubin NVL72 infrastructure within a unified design.