NVIDIA TensorRT Edge-LLM gains MoE support and Cosmos Reason 2 for edge AI inference

Key Additions to TensorRT Edge-LLM

NVIDIA's latest TensorRT Edge-LLM release brings significant enhancements for deploying advanced AI models on resource-constrained edge devices. The update introduces three major capability additions:

Mixture of Experts (MoE) Support: Full optimization for MoE architectures like Qwen3 MoE, allowing models to activate only a subset of expert parameters per token. This enables developers to access the reasoning capabilities of massive models while maintaining latency and power requirements suitable for real-time autonomous operations.
Cosmos Reason 2 Integration: Native support for NVIDIA's open planning model enables advanced spatio-temporal reasoning, 3D localization, and long-context processing directly on edge hardware for autonomous vehicles and humanoid robotics.
Speech Processing: Optimized Qwen3-TTS and Qwen3-ASR models enable low-latency, end-to-end voice interaction on edge devices with a Thinker-Talker framework.

Hybrid Reasoning and Memory Efficiency

The runtime now fully supports NVIDIA Nemotron 2 Nano, featuring a hybrid Mamba-2-Transformer architecture designed for edge deployment. This architecture reduces memory overhead from KV cache storage while maintaining precision from attention layers—critical for in-cabin AI assistants and robotic dialogue agents that must support complex retrieval-augmented generation (RAG) pipelines and agentic workflows.

Developers can now leverage dynamic "thinking" capabilities at the edge, allowing models to shift seamlessly between deep reasoning and immediate conversational responses, while maintaining strict memory footprint requirements on platforms like DRIVE AGX Thor and Jetson Thor.

Deployment Impact

The update addresses the core challenge of physical AI: running high-fidelity reasoning, multimodal interaction, and trajectory planning within strict power and latency envelopes. Support for optimized NVIDIA Nemotron models, combined with TensorRT-specific acceleration kernels, makes previously impractical edge deployments viable for autonomous vehicles and embodied agents.

Key Additions to TensorRT Edge-LLM

Hybrid Reasoning and Memory Efficiency

Deployment Impact

Tags

Published

Source