New Agent Skill for CUDA Kernel Development
Hugging Face has released a new agent skill that enables coding agents to automatically write optimized CUDA kernels for machine learning workloads. The skill teaches agents how to generate production-grade kernels complete with PyTorch bindings, GPU-specific optimizations, and comprehensive benchmarking—addressing a significant gap where domain expertise was previously scattered across documentation and Stack Overflow.
What's Included
The skill ships with the kernels library and can be installed via a single command (kernels skills add cuda-kernels --claude). It consists of approximately 550 tokens of structured guidance plus reference scripts, GPU optimization guides, and working examples covering:
- Hardware optimization for NVIDIA GPUs (H100, A100, T4) including compute capabilities, memory bandwidth profiles, and block sizing strategies
- Integration patterns for both
diffusersandtransformerslibraries, with library-specific pitfalls and solutions - Kernel templates with vectorized memory access patterns for BF16, FP16, and FP32 data types
- Benchmarking workflows for isolated micro-benchmarks and end-to-end pipeline performance validation
- Kernel Hub integration for loading pre-compiled community kernels
How Developers Use It
Once installed, developers can prompt their coding agent with specific requests:
Build a vectorized RMSNorm kernel for H100 targeting the Qwen3-8B model in transformers.
The agent reads the skill, selects appropriate architecture parameters, generates CUDA source code, writes PyTorch bindings, configures build.toml, and creates benchmark scripts—all without manual intervention.
Supported Agents
The skill supports Claude Code, Cursor, Codex, and OpenCode, with options to install globally or to custom destinations. Hugging Face is also open to contributions to the skill itself via GitHub.
Context and Impact
This release addresses a critical pain point in CUDA kernel development: the vast surface area of domain knowledge required, from hardware-specific optimization guides to library integration patterns. By packaging this expertise into an agent skill, Hugging Face makes it possible for agents to produce working, optimized kernels end-to-end. The authors demonstrated success with real production targets including both diffusers pipelines and transformers models.