← Back
NVIDIA
NVIDIA launches Groq 3 LPX inference accelerator with 35x throughput gains per megawatt
· releaseplatformperformancemodel · developer.nvidia.com ↗

NVIDIA Groq 3 LPX: A New Category of Inference Accelerator

NVIDIA has introduced the Groq 3 LPX, a rack-scale inference accelerator purpose-built for the emerging demands of agentic AI systems. Co-designed with the Vera Rubin NVL72 GPU, LPX targets the latency-sensitive portions of LLM inference—specifically FFN and MoE expert execution during token generation—while Rubin GPUs handle prefill and decode attention operations. This heterogeneous architecture enables data centers to achieve both high aggregate throughput and responsive interactive AI without sacrificing overall AI factory performance.

Performance and Architecture

The LPX system is built around 256 interconnected Groq 3 LPU accelerators delivering the following specifications:

  • 315 PFLOPS of FP8 compute at rack scale
  • 40 PB/s on-chip SRAM bandwidth per chip
  • 640 TB/s chip-to-chip bandwidth at rack scale
  • 128 GB total SRAM capacity across the system

The architecture emphasizes deterministic, compiler-orchestrated execution with explicit data movement and high on-chip bandwidth to minimize inference jitter and deliver stable, predictable per-token latency even under high concurrency.

Why This Matters for Agentic AI

As language models approach 1,000 tokens per second per user, AI systems are shifting from turn-based conversation toward "speed of thought" computing where agents can reason, simulate, and respond continuously. Multi-agent coordination further amplifies this demand—groups of coordinated agents accomplish far more than individual systems, similar to human collective intelligence.

The combination of Vera Rubin NVL72 and LPX enables this workload through heterogeneous serving: NVIDIA's Dynamo orchestration layer classifies requests and routes prefill/attention work to Rubin GPUs while directing latency-sensitive decode operations to LPUs. This approach maintains high AI factory throughput while delivering the sub-millisecond tail latency essential for interactive, agentic experiences.

Key Advantages

  • 35x higher inference throughput per megawatt compared to prior architectures
  • 10x more revenue opportunity for trillion-parameter model inference
  • Integrated with NVIDIA's MGX ETL rack architecture for common infrastructure deployment
  • Designed for long-context, multi-agent, and speculative decoding workloads

Next Steps for Deployers

Organizations deploying agentic AI systems should evaluate LPX as a dedicated inference path within their Vera Rubin clusters. The heterogeneous architecture allows existing Rubin deployments to integrate LPX incrementally, reserving it for latency-critical decode operations while continuing to use Rubin for general-purpose training and throughput-focused inference.