← Back
NVIDIA
NVIDIA Unveils Groq 3 LPX Inference Accelerator, Delivering 315 PFLOPS for Low-Latency AI Serving
· releaseplatformperformanceapi · developer.nvidia.com ↗

NVIDIA Groq 3 LPX: New Low-Latency Inference Architecture

NVIDIA has unveiled the NVIDIA Groq 3 LPX, a rack-scale inference accelerator designed to power the next generation of interactive AI applications and multi-agent systems. The LPX integrates 256 liquid-cooled NVIDIA Groq 3 LPU (Languid Processing Unit) chips and is co-designed to work alongside the Vera Rubin NVL72 GPU within the broader NVIDIA Vera Rubin platform.

Key Specifications and Performance

At rack scale, the LPX delivers impressive computational capability:

  • 315 PFLOPS of FP8 inference compute
  • 128 GB of total on-chip SRAM capacity
  • 40 PB/s on-chip SRAM bandwidth
  • 640 TB/s scale-up bandwidth (across 256 chips)
  • Deterministic execution optimized for low-latency serving and reduced jitter

Heterogeneous Inference Architecture

Rather than replacing Vera Rubin NVL72, the LPX complements it through a specialized division of labor. The LPX accelerates latency-sensitive portions of token generation—specifically FFN and MoE expert execution—while Vera Rubin NVL72 handles prefill and decode attention operations. This heterogeneous approach enables:

  • Up to 35x higher inference throughput per megawatt compared to traditional architectures
  • Up to 10x more revenue opportunity for trillion-parameter models
  • Responsive interactive inference as concurrency rises and request shapes vary

Architecture Highlights

The LPX comprises 32 liquid-cooled 1U compute trays, each containing:

  • 8 Groq 3 LP30 chips
  • 4 GB of on-chip SRAM
  • 1.2 PB/s SRAM bandwidth per tray
  • Direct chip-to-chip (C2C) links for tightly coupled, low-jitter communication
  • Support for up to 256 GB DRAM via fabric expansion and host CPU

The cableless, liquid-cooled design simplifies deployment and ensures deterministic latency characteristics critical for real-time AI applications.

Use Case Focus: Agentic AI at Speed of Thought

The LPX targets a new era of AI inference where generation speeds approach 1,000 tokens per second per user. At these speeds, AI systems can support continuous reasoning, simulation, and response loops—enabling "speed of thought" computing that powers coordinated multi-agent systems. This represents a shift from turn-based conversational AI toward real-time collaborative AI experiences that feel immediately responsive.

Deployment Path

LPX integrates with the NVIDIA MGX ETL rack architecture and aligns with the broader Vera Rubin platform, allowing data centers to deploy a dedicated low-latency inference path alongside general-purpose Vera Rubin NVL72 GPUs within a single infrastructure framework.