What is cuTile.jl?
cuTile.jl is a Julia package that brings NVIDIA's CUDA Tile programming model to Julia developers. CUDA Tile abstracts away low-level thread and memory management, allowing developers to write GPU kernels by describing operations on tiles of data rather than individual threads. The compiler automatically handles mapping these operations to hardware, including optimized access to tensor cores.
Syntax and Usability
cuTile.jl maintains deliberate syntax parity with the existing cuTile Python implementation, making it easy to port kernels between languages and leverage existing Python documentation. However, the package integrates Julia idioms throughout, including 1-based indexing and broadcast expressions for element-wise operations—familiar patterns for Julia programmers.
A key example is the row-normalization kernel (core of layer normalization): it uses standard Julia functions like sum() and sqrt() augmented for tiles, along with broadcasting operators (.^, .-, ./). This allows kernels to read like regular Julia array code, making it easier to share and reuse code between CPU and GPU.
Performance
cuTile.jl targets the same NVIDIA Tile IR backend as cuTile Python, producing identical GPU machine code. On NVIDIA RTX 5080 (Blackwell architecture), compute-intensive kernels achieve 91-100% performance parity with the Python implementation:
- Vector addition: 838 GB/s (99% vs Python)
- Matrix transpose: 797 GB/s (98% vs Python)
- Matrix multiplication: 50.9 TFLOPS (100% vs Python)
- Batch matrix multiply: 43.0 TFLOPS (91% vs Python)
Some kernels with complex control flow (layer normalization, FFT) still lag slightly as the Julia compiler matures—these are tracked as known issues under active development.
Getting Started
Developers can access cuTile.jl via the JuliaGPU GitHub repository. The package provides a high-level abstraction for GPU programming while maintaining performance-critical code generation for production workloads.