← Back
NVIDIA
NVIDIA launches Groq 3 LPX inference accelerator for low-latency agentic AI, claims 35x throughput per megawatt advantage
· releaseplatformperformanceintegration · developer.nvidia.com ↗

NVIDIA Introduces Groq 3 LPX Inference Accelerator

NVIDIA has unveiled the Groq 3 LPX, a new rack-scale inference accelerator co-designed with the NVIDIA Vera Rubin NVL72 GPU for next-generation agentic AI systems. The platform is specifically optimized for low-latency, large-context inference workloads where predictable per-token generation speed is critical for interactive AI experiences.

Key Architecture and Performance

The LPX system is built around 256 interconnected Groq 3 LPU accelerators organized into 32 liquid-cooled compute trays. The architecture emphasizes deterministic, compiler-orchestrated execution to minimize inference jitter and deliver stable latency even under high concurrency:

  • 315 PFLOPS of FP8 inference compute at rack scale
  • 128 GB total SRAM capacity with 40 PB/s on-chip SRAM bandwidth
  • 640 TB/s scale-up (chip-to-chip) bandwidth for coordinated rack-level execution
  • Up to 35x higher inference throughput per megawatt compared to prior solutions

Heterogeneous Serving Strategy

LPX is designed to work in tandem with Vera Rubin NVL72, creating a split-brain inference architecture:

  • Vera Rubin NVL72 handles prefill and decode attention (flexible, general-purpose)
  • Groq 3 LPX handles latency-sensitive FFN (feed-forward network) and MoE (mixture-of-experts) decode operations
  • NVIDIA Dynamo orchestrates request routing and disaggregated serving between the two systems

This heterogeneous approach allows data centers to sustain high overall AI factory throughput while delivering the sub-100ms tail latencies required for interactive and agentic AI applications.

Use Cases and Deployment

The system targets emerging workloads where speed of thought matters:

  • Multi-agent systems requiring coordinated reasoning across multiple AI agents
  • Long-context inference with stable latency across large token windows
  • High-concurrency serving where responsive per-token generation is competitive advantage
  • Speculative decoding for further acceleration of token generation

NVIDIA positions LPX as a natural complement to its broader Vera Rubin platform, deployable within existing MGX ETL rack infrastructure for seamless integration into existing data center deployments.