Frontier Model Now Available on Workers AI
Cloudflare has made Kimi K2.5, an open-source frontier-scale model from Moonshot AI, available on its Workers AI inference platform. This marks the first major frontier model natively integrated into Cloudflare's AI stack, enabling developers to build sophisticated AI agents without leaving the platform.
Key Capabilities
Kimi K2.5 brings several advanced features to Workers AI:
- 256,000 token context window — Maintains full conversation history, tool definitions, and entire codebases across long-running agent sessions
- Multi-turn tool calling — Agents can invoke external tools and APIs across multiple conversation turns
- Vision inputs — Process images alongside text in the same request
- Structured outputs — JSON mode and JSON Schema support for reliable downstream parsing
- Function calling — Seamless integration of external APIs and tools into agent workflows
Performance Optimizations
Cloudflare has enhanced its inference infrastructure with two major improvements:
Prefix Caching: The platform now surfaces cached tokens as a usage metric and applies discounted pricing to cached tokens compared to fresh input tokens. This optimization significantly reduces Time to First Token (TTFT) and increases Tokens Per Second (TPS) throughput, particularly beneficial for agents that resend context across multiple turns.
Asynchronous Batch API: A redesigned pull-based system allows developers to submit batches of inference requests that process as capacity becomes available, typically completing within 5 minutes. This eliminates capacity errors and is ideal for non-real-time workflows like code scanning or research agents.
Multiple Access Methods
Developers can access Kimi K2.5 through several interfaces:
- Workers AI binding (
env.AI.run()) - REST API endpoints (
/runor/v1/chat/completions) - AI Gateway
- OpenAI-compatible endpoint for drop-in compatibility
Complete documentation, pricing details, and prompt caching guidelines are available on the official model page.