Production-Grade Distributed Inference at Scale
NVIDIA Dynamo 1.0 is now available as a mature, production-grade distributed inference framework designed for deploying reasoning and generative AI models across multiple GPU nodes. The framework addresses the complexity of orchestrating large-scale, multi-node AI deployments by providing low-latency, high-throughput inference capabilities with proven results in trusted benchmarks like MLPerf and SemiAnalysis InferenceX.
Performance and Real-World Adoption
Dynamo demonstrates 7x throughput improvements on NVIDIA Blackwell hardware when combined with disaggregated serving and wide expert parallel strategies. The framework has achieved significant real-world adoption, with early deployments at AstraZeneca, Baseten, ByteDance, CoreWeave, Crusoe, DigitalOcean, Gcore, Meituan, Pinterest, Tencent Cloud, Together AI, and Vultr. Major cloud providers including Alibaba Cloud, AWS, Google Cloud, Microsoft Azure, and Oracle Cloud Infrastructure have built native integrations into their managed Kubernetes environments.
Key Capabilities and Enhancements
Dynamo 1.0 supports leading open-source inference engines including SGLang, NVIDIA TensorRT LLM, and vLLM. Recent enhancements include:
- Agentic inference optimizations: Priority-based routing and cache pinning for improved efficiency in agentic AI workflows
- Multimodal acceleration: Disaggregated encode/prefill/decode, embedding cache, and multimodal KV routing for faster multimodal model inference
- Video generation support: Native integration for video-generation models
- ModelExpress: Delivers 7x faster startup through checkpoint restore and weight streaming with NVIDIA NVLink and NIXL
- Kubernetes orchestration: Grove API for topology-aware scheduling on NVIDIA GB300 NVL72
- Resilient inference: Layered fault detection, request cancellation, and migration capabilities
- KV Block Manager: Pip-installable module with object storage integration for flexible cache management
Developer Access and Integration
The framework is available for pip installation and is designed for seamless integration into existing inference workflows. Modular components like NIXL have been widely adopted by the inference ecosystem, including llm-d, TensorRT LLM, SGLang, and vLLM for accelerating KV cache transfers between GPUs.