← Back
cuTile.jl Brings CUDA Tile-Based Programming to Julia, Achieving 91-100% Performance Parity with Python
· releasefeaturesdkperformanceintegration · developer.nvidia.com ↗

Overview

NVIDIA has released cuTile.jl, a Julia package that brings the CUDA Tiles programming model to the Julia language. This release extends NVIDIA's tile-based GPU programming approach—which abstracts away low-level thread and memory management—from Python to Julia, enabling developers in both ecosystems to write high-performance GPU kernels more intuitively.

What Tile-Based Programming Offers

Traditional CUDA programming requires developers to manually manage threads, warps, and memory hierarchies. cuTile.jl simplifies this by allowing developers to describe operations on tiles of data, letting the compiler handle hardware mapping automatically. For example, a simple vector addition kernel in traditional CUDA requires explicit thread indexing and bounds checking, while the cuTile approach lets developers focus on tile-level operations like load, compute, and store.

Julia Idioms and API Compatibility

cuTile.jl maintains close syntax and abstraction parity with cuTile Python, making it easy to port code between languages and leverage existing documentation. At the same time, it embraces Julia conventions:

  • 1-based indexing (vs. Python's 0-based)
  • Broadcasting syntax (.^, .-, ./) for element-wise operations
  • Standard Julia functions like sum, size, and sqrt augmented to work on tiles
  • Kernels read like ordinary Julia array code, enabling code sharing between CPU and GPU

Performance

cuTile.jl targets the same NVIDIA Tile IR backend as the Python version, producing identical GPU machine code. Performance benchmarks on NVIDIA GeForce RTX 5080 (Blackwell architecture) show:

  • Vector addition: 99% performance parity
  • Matrix transpose: 98% performance parity
  • Matrix multiplication: 100% performance parity
  • Batch matrix multiply: 91% performance parity

Complex kernels with intricate control flow (layer normalization, FFT) currently lag slightly as the compiler matures, but these gaps are being actively addressed.

Getting Started

Developers can access cuTile.jl from the Julia GPU GitHub organization. The package enables access to tensor cores and specialized NVIDIA hardware while maintaining a high-level abstraction that reduces development complexity.