New Runtime Capabilities for Edge AI
NVIDIA TensorRT Edge-LLM, a high-performance C++ inference runtime for large language models (LLMs) and vision language models (VLMs) on embedded platforms, has been expanded to support advanced architectures critical for autonomous systems. The latest release introduces mixture of experts (MoE) support, enabling models like Qwen3 MoE to run efficiently on edge devices by activating only a subset of expert parameters per token while maintaining the reasoning capabilities of much larger models.
Hybrid Reasoning and Context Windows
The update brings specialized support for NVIDIA Nemotron 2 Nano, featuring a novel Hybrid Mamba-2-Transformer architecture that reduces memory overhead from KV cache storage while preserving high-fidelity precision from attention layers. This enables developers to deploy complex retrieval-augmented generation (RAG) pipelines and agentic workflows on edge hardware with massive context windows—critical for in-cabin AI assistants and robotic dialogue agents.
Multimodal and Planning Capabilities
New integrations include Qwen3-TTS and Qwen3-ASR models for native multimodal interaction, enabling end-to-end, low-latency voice dialogue on embedded platforms. Additionally, NVIDIA Cosmos Reason 2 provides advanced spatio-temporal reasoning, 3D localization, and long-context processing tailored for humanoid robotics and embodied agents. For autonomous vehicles, NVIDIA Alpamayo integration supports end-to-end trajectory planning with flow matching decoding and explainable decision-making across multi-camera inputs.
Target Platforms and Use Cases
These capabilities are optimized for NVIDIA DRIVE AGX Thor (in-vehicle computing) and NVIDIA Jetson Thor (embedded systems), enabling developers to build next-generation autonomous machines that operate within strict power and latency constraints while delivering high-fidelity reasoning and real-time interaction.