NVIDIA TensorRT Edge-LLM adds mixture-of-experts and Cosmos reasoning for embedded autonomous systems

Advanced Edge Reasoning Architecture

NVIDIA's latest TensorRT Edge-LLM release introduces critical infrastructure for deploying sophisticated AI on embedded platforms. The update focuses on overcoming the core challenge of enabling high-fidelity reasoning, real-time multimodal interaction, and trajectory planning within the strict power and latency requirements of autonomous vehicles and robotics.

Mixture-of-Experts Support at Scale

The runtime now fully supports mixture-of-experts (MoE) architectures, specifically optimized for models like Qwen3 MoE. MoE models activate only a subset of expert parameters per token, allowing edge devices to access the reasoning capabilities of massive models while maintaining inference latency and compute footprint comparable to much smaller systems. This architectural shift is critical for deploying high-fidelity reasoning on NVIDIA DRIVE AGX Thor and NVIDIA Jetson Thor without exceeding power and latency budgets.

Hybrid Reasoning with Nemotron 2 Nano

The release introduces optimized support for NVIDIA Nemotron 2 Nano, enabling System 2 reasoning directly on embedded chipsets. This model uses a novel Hybrid Mamba-2-Transformer architecture that reduces memory overhead from KV cache storage while maintaining precision from attention layers. TensorRT Edge-LLM provides specialized kernels that accelerate these hybrid layers, enabling complex retrieval-augmented generation (RAG) pipelines and agentic workflows on edge devices.

Multimodal Interaction & Embodied AI

Native support for Qwen3-TTS and Qwen3-ASR models enables low-latency end-to-end voice dialogue with a Thinker-Talker framework. Additionally, Cosmos Reason 2 integration provides advanced spatio-temporal reasoning, 3D localization, and long-context processing—critical capabilities for humanoid robotics and embodied agents operating at the edge.

Production-Ready VLA Models

The update includes NVIDIA Alpamayo integration for end-to-end trajectory planning in autonomous vehicles. This leverages flow matching trajectory decoding, explainable decision-making with multicamera context, and FP8-accelerated Vision Transformers. This represents a shift from modular software stacks to reasoning-based vision-language-action (VLA) models designed for production autonomous systems.

Developer Tools & Deployment

All capabilities are available through NVIDIA's TensorRT Edge-LLM open-source runtime on GitHub, providing developers with C++ inference optimizations for LLMs and vision language models. The release targets developers building next-generation autonomous vehicles, in-cabin AI assistants, and robotic dialogue systems.

Advanced Edge Reasoning Architecture

Mixture-of-Experts Support at Scale

Hybrid Reasoning with Nemotron 2 Nano

Multimodal Interaction & Embodied AI

Production-Ready VLA Models

Developer Tools & Deployment

Tags

Published

Source