NVIDIA Dynamo 1.0 achieves 7x throughput boost for multi-node AI inference at scale

Production-Grade Distributed Inference at Scale

NVIDIA Dynamo 1.0 is now available as a mature, production-grade distributed inference framework designed for deploying reasoning and generative AI models across multiple GPU nodes. The framework addresses the complexity of orchestrating large-scale, multi-node AI deployments by providing low-latency, high-throughput inference capabilities with proven results in trusted benchmarks like MLPerf and SemiAnalysis InferenceX.

Performance and Real-World Adoption

Dynamo demonstrates 7x throughput improvements on NVIDIA Blackwell hardware when combined with disaggregated serving and wide expert parallel strategies. The framework has achieved significant real-world adoption, with early deployments at AstraZeneca, Baseten, ByteDance, CoreWeave, Crusoe, DigitalOcean, Gcore, Meituan, Pinterest, Tencent Cloud, Together AI, and Vultr. Major cloud providers including Alibaba Cloud, AWS, Google Cloud, Microsoft Azure, and Oracle Cloud Infrastructure have built native integrations into their managed Kubernetes environments.

Key Capabilities and Enhancements

Dynamo 1.0 supports leading open-source inference engines including SGLang, NVIDIA TensorRT LLM, and vLLM. Recent enhancements include:

Agentic inference optimizations: Priority-based routing and cache pinning for improved efficiency in agentic AI workflows
Multimodal acceleration: Disaggregated encode/prefill/decode, embedding cache, and multimodal KV routing for faster multimodal model inference
Video generation support: Native integration for video-generation models
ModelExpress: Delivers 7x faster startup through checkpoint restore and weight streaming with NVIDIA NVLink and NIXL
Kubernetes orchestration: Grove API for topology-aware scheduling on NVIDIA GB300 NVL72
Resilient inference: Layered fault detection, request cancellation, and migration capabilities
KV Block Manager: Pip-installable module with object storage integration for flexible cache management

Developer Access and Integration

The framework is available for pip installation and is designed for seamless integration into existing inference workflows. Modular components like NIXL have been widely adopted by the inference ecosystem, including llm-d, TensorRT LLM, SGLang, and vLLM for accelerating KV cache transfers between GPUs.

Production-Grade Distributed Inference at Scale

Performance and Real-World Adoption

Key Capabilities and Enhancements

Developer Access and Integration

Tags

Published

Source