← Back
NVIDIA
NVIDIA TensorRT Edge-LLM adds MoE, Cosmos Reason 2, and Qwen3 speech models for embedded AI systems
· releasefeaturesdkplatformperformancemodel · developer.nvidia.com ↗

Major Capabilities Expanded

NVIDIA TensorRT Edge-LLM, the C++ inference runtime for language and vision-language models on embedded platforms, now supports advanced architectures previously difficult to deploy on edge hardware. Key additions include:

  • Mixture of Experts (MoE) support: Enables efficient inference of large models like Qwen3 MoE by activating only a subset of expert parameters per token, maintaining low latency while accessing reasoning capabilities of much larger models
  • Hybrid Mamba-2-Transformer architecture: Native support for NVIDIA Nemotron 2 Nano, providing System 2 reasoning capabilities while reducing memory footprint compared to traditional attention-only models
  • Speech processing: Integrated Qwen3-TTS and Qwen3-ASR models for end-to-end voice dialogue with sub-millisecond latency

Use Cases and Developer Impact

The release targets autonomous vehicles and robotics applications requiring real-time, low-latency responses. Cosmos Reason 2 integration enables spatio-temporal reasoning, 3D localization, and long-context processing for humanoid robots and embodied agents. NVIDIA Alpamayo support adds end-to-end trajectory planning for AVs with explainable decision-making using multicamera input.

Developers can now deploy sophisticated reasoning models within strict power envelopes on NVIDIA DRIVE AGX Thor, NVIDIA Jetson Thor, and similar embedded platforms. The specialized kernels optimize hybrid layer execution, allowing complex retrieval-augmented generation (RAG) pipelines and agentic workflows while maintaining device memory constraints.

Production-Ready Physical AI

This update represents a shift from modular AI stacks to reasoning-based approaches for autonomous systems. Developers can leverage dynamic "thinking" capabilities—models can shift between deep reasoning and immediate action—critical for in-cabin AI assistants and robotic agents handling complex queries in real time.