Overview
NVIDIA has released cuTile.jl, extending its CUDA Tile programming model to the Julia language. This library simplifies GPU kernel development by allowing programmers to work with tiles of data rather than individual threads, with the compiler handling the mapping to hardware. The release brings the same programming paradigm that made cuTile for Python accessible to Julia developers.
Key Features
Abstraction Layer: cuTile.jl hides low-level details like thread indexing, warp management, and explicit out-of-bounds checks. Developers describe operations on data tiles, and the compiler automatically optimizes for tensor cores and other specialized hardware.
Julia Idioms: The library maintains close syntax parity with cuTile Python but leverages Julia-specific conventions:
- 1-based indexing (instead of Python's 0-based)
- Broadcasting syntax (
.^,.-,./) for element-wise operations - Standard Julia functions like
sum,sqrt, andsizeaugmented to work on tiles
Code Portability: Kernels written in cuTile.jl read nearly identically to their Python counterparts, making it easy to port code between languages and leverage existing documentation.
Performance Characteristics
On NVIDIA GeForce RTX 5080 (Blackwell architecture), cuTile.jl achieves strong performance parity with cuTile Python:
- Vector addition: 838 GB/s (99% of Python)
- Matrix transpose: 797 GB/s (98% of Python)
- Matrix multiplication: 50.9 TFLOPS (100% of Python)
- Batch matrix multiply: 43.0 TFLOPS (91% of Python)
More complex kernels with intricate control flow (layer normalization, FFT) currently lag slightly as the Julia compiler matures, though developers note these issues are being actively addressed.
Getting Started
The package is available at github.com/JuliaGPU/cuTile.jl and enables Julia developers to access the same high-level programming model that made CUDA programming more accessible in Python, now tailored for Julia's numerical computing ecosystem.