NVIDIA releases cuTile.jl for Julia, bringing tile-based CUDA programming with near-parity Python performance

Overview

NVIDIA has released cuTile.jl, extending its CUDA Tile programming model to the Julia language. This library simplifies GPU kernel development by allowing programmers to work with tiles of data rather than individual threads, with the compiler handling the mapping to hardware. The release brings the same programming paradigm that made cuTile for Python accessible to Julia developers.

Key Features

Abstraction Layer: cuTile.jl hides low-level details like thread indexing, warp management, and explicit out-of-bounds checks. Developers describe operations on data tiles, and the compiler automatically optimizes for tensor cores and other specialized hardware.

Julia Idioms: The library maintains close syntax parity with cuTile Python but leverages Julia-specific conventions:

1-based indexing (instead of Python's 0-based)
Broadcasting syntax (.^, .-, ./) for element-wise operations
Standard Julia functions like sum, sqrt, and size augmented to work on tiles

Code Portability: Kernels written in cuTile.jl read nearly identically to their Python counterparts, making it easy to port code between languages and leverage existing documentation.

Performance Characteristics

On NVIDIA GeForce RTX 5080 (Blackwell architecture), cuTile.jl achieves strong performance parity with cuTile Python:

Vector addition: 838 GB/s (99% of Python)
Matrix transpose: 797 GB/s (98% of Python)
Matrix multiplication: 50.9 TFLOPS (100% of Python)
Batch matrix multiply: 43.0 TFLOPS (91% of Python)

More complex kernels with intricate control flow (layer normalization, FFT) currently lag slightly as the Julia compiler matures, though developers note these issues are being actively addressed.

Getting Started

The package is available at github.com/JuliaGPU/cuTile.jl and enables Julia developers to access the same high-level programming model that made CUDA programming more accessible in Python, now tailored for Julia's numerical computing ecosystem.

Overview

Key Features

Performance Characteristics

Getting Started

Tags

Published

Source