New Capabilities for Edge AI
NVIDIA TensorRT Edge-LLM, the company's high-performance C++ inference runtime for LLMs and vision language models on embedded platforms, has been significantly expanded to support advanced architectures needed for autonomous vehicles and robotics. The latest release introduces:
- Mixture of Experts (MoE) support: Enables efficient deployment of massive models like Qwen3 MoE by activating only a subset of expert parameters per token, allowing edge devices to access large-model reasoning capabilities while maintaining strict power and latency requirements.
- Cosmos Reason 2 integration: NVIDIA's open planning model for physical AI enables advanced spatio-temporal reasoning, 3D localization, and long-context processing for humanoid robotics and embodied agents.
- Native multimodal interaction: Optimized Qwen3-TTS and Qwen3-ASR models enable low-latency voice dialogue and end-to-end speech processing directly on edge hardware.
System 2 Reasoning on Embedded Hardware
The runtime now fully supports NVIDIA Nemotron 2 Nano, introducing hybrid reasoning capabilities on embedded chipsets including NVIDIA DRIVE Thor and Jetson Thor. This model uses a novel Hybrid Mamba-2-Transformer architecture that reduces memory footprint from KV cache storage while maintaining high-fidelity precision through attention layers.
TensorRT Edge-LLM provides optimized kernels for these hybrid layers, enabling developers to:
- Deploy complex retrieval-augmented generation (RAG) pipelines with massive context windows
- Build agentic workflows that leverage reasoning capabilities
- Support dynamic "thinking" modes that shift seamlessly between deep reasoning and immediate conversational responses
This is particularly valuable for in-cabin AI assistants and robotic dialogue agents that need both sophisticated reasoning and real-time responsiveness.
Developer Impact
The expanded TensorRT Edge-LLM runtime enables developers to build next-generation autonomous machines with advanced reasoning, perception, and interaction capabilities while respecting the strict power, latency, and memory constraints of edge platforms. Support for open model families like Nemotron provides production-ready options for enterprise deployments.