Hugging Face releases CUDA kernel skill for Claude and Codex agents

New Agent Skill for CUDA Kernel Development

Hugging Face has released a new agent skill that enables coding agents to automatically write optimized CUDA kernels for machine learning workloads. The skill teaches agents how to generate production-grade kernels complete with PyTorch bindings, GPU-specific optimizations, and comprehensive benchmarking—addressing a significant gap where domain expertise was previously scattered across documentation and Stack Overflow.

What's Included

The skill ships with the kernels library and can be installed via a single command (kernels skills add cuda-kernels --claude). It consists of approximately 550 tokens of structured guidance plus reference scripts, GPU optimization guides, and working examples covering:

Hardware optimization for NVIDIA GPUs (H100, A100, T4) including compute capabilities, memory bandwidth profiles, and block sizing strategies
Integration patterns for both diffusers and transformers libraries, with library-specific pitfalls and solutions
Kernel templates with vectorized memory access patterns for BF16, FP16, and FP32 data types
Benchmarking workflows for isolated micro-benchmarks and end-to-end pipeline performance validation
Kernel Hub integration for loading pre-compiled community kernels

How Developers Use It

Once installed, developers can prompt their coding agent with specific requests:

Build a vectorized RMSNorm kernel for H100 targeting the Qwen3-8B model in transformers.

The agent reads the skill, selects appropriate architecture parameters, generates CUDA source code, writes PyTorch bindings, configures build.toml, and creates benchmark scripts—all without manual intervention.

Supported Agents

The skill supports Claude Code, Cursor, Codex, and OpenCode, with options to install globally or to custom destinations. Hugging Face is also open to contributions to the skill itself via GitHub.

Context and Impact

This release addresses a critical pain point in CUDA kernel development: the vast surface area of domain knowledge required, from hardware-specific optimization guides to library integration patterns. By packaging this expertise into an agent skill, Hugging Face makes it possible for agents to produce working, optimized kernels end-to-end. The authors demonstrated success with real production targets including both diffusers pipelines and transformers models.

New Agent Skill for CUDA Kernel Development

What's Included

How Developers Use It

Supported Agents

Context and Impact

Tags

Published

Source