← Back
NVIDIA TensorRT LLM adds AutoDeploy beta, automating model-to-inference optimization pipeline
· featuresdkapiperformance · developer.nvidia.com ↗

AutoDeploy Shifts LLM Deployment from Manual to Compiler-Driven

NVIDIA has released AutoDeploy as a beta feature in TensorRT LLM, addressing a critical bottleneck in LLM deployment. Traditionally, deploying new model architectures requires significant manual engineering effort to add KV cache management, weight sharding, operation fusion, and hardware-specific optimizations. AutoDeploy eliminates this manual labor by automatically extracting computation graphs from off-the-shelf PyTorch models and applying optimizations through a compiler pipeline.

Key Capabilities

AutoDeploy provides several core capabilities that accelerate deployment workflows:

  • Seamless model translation: Automatically converts Hugging Face models into TensorRT LLM graphs without requiring engineers to rewrite model code
  • Separation of concerns: Keeps PyTorch as the canonical model definition while delegating inference-specific optimizations to the compiler
  • Automated optimizations: Applies sharding, quantization, KV cache insertion, attention fusion, CUDA graph optimization, and related transformations
  • Broad model support: Currently supports over 100 text-to-text LLMs with early support for vision language models (VLMs) and state space models (SSMs)

Technical Architecture

AutoDeploy uses PyTorch's torch.export API to capture models as standardized Torch graphs, then applies automated transformations for pattern matching and canonicalization of common building blocks like mixture of experts (MoE), attention layers, RoPE, and state-space components. The compiler then generates an inference-optimized TensorRT LLM graph that delegates runtime concerns to the TensorRT LLM runtime.

Impact and Use Cases

This approach is particularly valuable for the "long tail" of models—new research architectures, internal variants, and fast-moving open source models where manual optimization is impractical. AutoDeploy enables deployment at launch with competitive baseline performance, while preserving a path for incremental optimization as models mature. NVIDIA demonstrated this capability by using AutoDeploy to support NVIDIA Nemotron models at their release.