← Back
Cloudflare Workers AI launches Kimi K2.5, cutting LLM costs by 77% for agentic workloads
Cloudflare WorkersCloudflare · releasefeaturemodelapiplatformperformance · blog.cloudflare.com ↗

Frontier Models Now Available on Workers AI

Cloudflare is expanding Workers AI beyond smaller models to include frontier-class open-source LLMs. Starting today, developers can access Moonshot AI's Kimi K2.5 model directly through the Workers AI platform. This marks a significant shift in Cloudflare's AI inference strategy, bringing large-scale model inference to their Developer Platform alongside existing agent infrastructure primitives like Durable Objects, Workflows, and the Agents SDK.

Kimi K2.5 offers the core capabilities required for agentic workloads:

  • 256k token context window for handling complex, long-running tasks
  • Multi-turn tool calling for agent-driven workflows
  • Vision input support for multimodal reasoning
  • Structured output capabilities for reliable programmatic use

The Cost-Performance Case

In production testing across Cloudflare's internal development tools, Kimi K2.5 has proven to be both fast and remarkably cost-efficient. The company's security code review agent processes over 7 billion tokens daily across codebases. Running this workload on Kimi K2.5 cost a fraction of proprietary alternatives—achieving 77% cost savings compared to mid-tier proprietary models while maintaining quality. This same agent has identified 15+ confirmed security issues in a single codebase, demonstrating that cost reduction didn't compromise capability.

As personal and coding agents proliferate, cost is becoming the primary blocker to scaling AI adoption. When organizations deploy multiple agents processing hundreds of thousands of tokens per hour, proprietary model pricing becomes unsustainable.

Optimized Inference Infrastructure

To serve Kimi K2.5 effectively, Cloudflare engineered several platform improvements atop their proprietary Infire inference engine:

  • Custom kernels optimized specifically for Kimi K2.5, improving GPU utilization beyond default configurations
  • Advanced parallelization techniques (data, tensor, and expert parallelization) for handling large models
  • Disaggregated prefill strategies that separate prefill and generation stages across different hardware for better throughput
  • Prefix caching to reduce redundant computation for agentic workflows

Developers benefit from these optimizations immediately via API—no need for ML engineering or DevOps expertise to achieve production-grade model serving.

What Developers Should Do

To get started, visit the Workers AI documentation for Kimi K2.5. The model integrates seamlessly with existing Cloudflare agent infrastructure, enabling end-to-end agentic workflows on a single unified platform. For cost-sensitive workloads currently using proprietary models, evaluating Kimi K2.5 could yield substantial savings without compromising reasoning quality.