← Back
Kimi K2.5 multimodal VLM now available on NVIDIA GPU-accelerated endpoints with free prototyping access
· featuremodelapiintegrationplatform · developer.nvidia.com ↗

Kimi K2.5 Model Architecture

Kimi K2.5 is a general-purpose multimodal vision language model trained using the open-source Megatron-LM framework. The model contains 1 trillion total parameters with a 3.2% activation rate per token, achieved through a mixture-of-experts (MoE) architecture with 384 experts and 1 shared dense layer. This design enables efficient processing across multiple modalities with only 32.86B active parameters per token.

The model supports text, image, and video inputs across a 262K token context window, making it well-suited for agentic AI workflows, reasoning, coding, mathematics, and chat applications. Kimi developed the MoonViT3d Vision Tower to handle visual processing, converting images and video frames into embeddings that integrate seamlessly with the language model component.

Free Prototyping and Deployment Options

Developers can access Kimi K2.5 for free prototyping through NVIDIA GPU-accelerated endpoints on build.nvidia.com as part of the NVIDIA Developer Program. The browser-based experience allows testing with custom data without requiring API setup.

API Integration: The model is accessible through an OpenAI-compatible API endpoint at https://integrate.api.nvidia.com/v1/chat/completions with support for tool calling via the tools parameter. NVIDIA NIM microservices containers for production inference are coming soon.

Deployment with vLLM: For teams deploying with the vLLM serving framework, detailed instructions and recipes are available in the official vLLM documentation. Additional deployment and fine-tuning options are available through the NVIDIA NeMo Framework for domain-specific customization.

Getting Started

Developers can start building immediately by visiting the Kimi K2.5 page on build.nvidia.com with free registration in the NVIDIA Developer Program. The model supports standard chat completion parameters including temperature, top_p, frequency penalties, and streaming responses.