← Back
NVIDIA
NVIDIA Dynamo 1.0 achieves 7x throughput boost for multi-node AI inference at scale
· releasefeatureplatformperformanceintegration · developer.nvidia.com ↗

Production-Grade Distributed Inference at Scale

NVIDIA Dynamo 1.0 is now available as a mature, production-grade distributed inference framework designed for deploying reasoning and generative AI models across multiple GPU nodes. The framework addresses the complexity of orchestrating large-scale, multi-node AI deployments by providing low-latency, high-throughput inference capabilities with proven results in trusted benchmarks like MLPerf and SemiAnalysis InferenceX.

Performance and Real-World Adoption

Dynamo demonstrates 7x throughput improvements on NVIDIA Blackwell hardware when combined with disaggregated serving and wide expert parallel strategies. The framework has achieved significant real-world adoption, with early deployments at AstraZeneca, Baseten, ByteDance, CoreWeave, Crusoe, DigitalOcean, Gcore, Meituan, Pinterest, Tencent Cloud, Together AI, and Vultr. Major cloud providers including Alibaba Cloud, AWS, Google Cloud, Microsoft Azure, and Oracle Cloud Infrastructure have built native integrations into their managed Kubernetes environments.

Key Capabilities and Enhancements

Dynamo 1.0 supports leading open-source inference engines including SGLang, NVIDIA TensorRT LLM, and vLLM. Recent enhancements include:

  • Agentic inference optimizations: Priority-based routing and cache pinning for improved efficiency in agentic AI workflows
  • Multimodal acceleration: Disaggregated encode/prefill/decode, embedding cache, and multimodal KV routing for faster multimodal model inference
  • Video generation support: Native integration for video-generation models
  • ModelExpress: Delivers 7x faster startup through checkpoint restore and weight streaming with NVIDIA NVLink and NIXL
  • Kubernetes orchestration: Grove API for topology-aware scheduling on NVIDIA GB300 NVL72
  • Resilient inference: Layered fault detection, request cancellation, and migration capabilities
  • KV Block Manager: Pip-installable module with object storage integration for flexible cache management

Developer Access and Integration

The framework is available for pip installation and is designed for seamless integration into existing inference workflows. Modular components like NIXL have been widely adopted by the inference ecosystem, including llm-d, TensorRT LLM, SGLang, and vLLM for accelerating KV cache transfers between GPUs.