← Back
NVIDIA
NVIDIA launches Groq 3 LPX inference accelerator with 315 PFLOPS compute for agentic AI workloads
· releasehardwareplatformperformancemodel · developer.nvidia.com ↗

NVIDIA Groq 3 LPX: Low-Latency Inference for Agentic Systems

NVIDIA has introduced the Groq 3 LPX, a purpose-built rack-scale inference accelerator designed to address the emerging demands of agentic AI systems that require fast, predictable token generation at scale. The system integrates 256 interconnected NVIDIA Groq 3 LPU accelerators and represents a specialized counterpart to the Vera Rubin NVL72 GPU platform.

Key Performance Specifications

The LPX delivers impressive infrastructure-level performance metrics:

  • 315 PFLOPS of FP8 compute at rack scale
  • 40 PB/s on-chip SRAM bandwidth
  • 640 TB/s scale-up bandwidth
  • 128 GB total SRAM capacity across the system
  • 256 Groq 3 LP30 accelerator chips per rack

Each individual compute tray (containing 8 LP30 chips) provides 9.6 PFLOPS of AI inference compute, 1.2 PB/s SRAM bandwidth, and 20 TB/s scale-up bandwidth, supporting up to 384 GB of accessible DRAM through fabric expansion logic and host CPU connections.

Architecture and Design Philosophy

The LPX uses a cableless, liquid-cooled design across 32 1U compute trays per rack. Each tray features direct chip-to-chip (C2C) links for low-latency communication within and across trays, designed to minimize coordination overhead and jitter—critical for interactive inference workloads.

The system is optimized for heterogeneous serving: LPX accelerates latency-sensitive portions of the decode loop (FFN and MoE expert execution), while Vera Rubin NVL72 GPUs handle prefill and decode attention. This division of labor enables simultaneous optimization for both interactive responsiveness and high aggregate token throughput.

Why This Matters

As AI moves toward agentic architectures that support generation speeds approaching 1,000 tokens per second per user, the infrastructure requirements shift. Interactive AI systems need deterministic, low-latency execution rather than just raw throughput. The combination of LPX and Vera Rubin NVL72 enables data centers to deploy dual-purpose AI factories that support both high-concurrency batch processing and real-time multi-agent coordination.