← Back
NVIDIA
NVIDIA TensorRT Edge-LLM adds MoE support, Cosmos Reason 2, and speech models for edge AI
· releasefeaturemodelplatformperformancesdk · developer.nvidia.com ↗

Efficient MoE Inference at the Edge

NVIDIA TensorRT Edge-LLM now fully supports mixture-of-experts (MoE) architectures, enabling models like Qwen3 MoE to run efficiently on embedded hardware. MoE models activate only a subset of expert parameters per token, allowing edge devices to access the reasoning capabilities of large models while maintaining the inference latency and power footprint of smaller ones. This is critical for autonomous vehicles and robotics operating under strict real-time and power constraints.

Hybrid Reasoning with Nemotron 2 Nano

The release introduces specialized support for NVIDIA Nemotron 2 Nano, which uses a novel Hybrid Mamba-2-Transformer architecture. This hybrid approach reduces KV cache memory overhead while maintaining high-fidelity attention-based precision, enabling complex retrieval-augmented generation (RAG) and agentic workflows on embedded platforms like NVIDIA DRIVE AGX Thor and Jetson Thor. TensorRT Edge-LLM provides optimized kernels that accelerate these hybrid layers, making System 2 reasoning directly available at the edge.

Multimodal and Perception Capabilities

The runtime adds native support for Qwen3-TTS and Qwen3-ASR models, enabling low-latency voice interaction on embedded platforms through a Thinker-Talker framework. Additionally, Cosmos Reason 2 brings advanced spatio-temporal reasoning, 3D localization, and long-context processing for humanoid robotics and embodied agents. For autonomous vehicles, NVIDIA Alpamayo integration provides end-to-end trajectory planning with flow matching decoding, explainable decision-making, and FP8-accelerated Vision Transformers.

Production-Ready Physical AI

These capabilities represent a shift from modular AI stacks to production-ready, reasoning-based vision-language-action (VLA) models optimized for embedded hardware. Developers can now build next-generation autonomous systems with high-fidelity perception, planning, and dialogue capabilities while staying within power and latency budgets required for mission-critical operations.