← Back
NVIDIA
cuTile.jl brings NVIDIA tile-based GPU programming to Julia with near-parity performance
· featuresdkapiperformanceopen-source · developer.nvidia.com ↗

Overview

cuTile.jl is a new Julia package that brings NVIDIA's tile-based GPU programming model to the Julia ecosystem. This abstraction simplifies GPU kernel development by allowing developers to reason about operations on tiles of data rather than managing individual threads, warps, and memory hierarchies directly.

Key Improvements Over Traditional GPU Programming

Traditional CUDA.jl requires explicit thread management and index calculations. With cuTile.jl, developers work at a higher abstraction level:

  • Automatic thread management: The compiler handles mapping tile operations to GPU hardware
  • Cleaner syntax: Kernels read like standard Julia array code with broadcasting operators (.^, .-, ./)
  • Less boilerplate: No manual out-of-bounds checks or complex index arithmetic needed

Design Philosophy

cuTile.jl maintains close syntax and abstraction parity with the existing cuTile Python implementation, making it easy to port code between languages and reference Python documentation. Simultaneously, it adopts Julia idioms including 1-based indexing and standard Julia functions like sum, size, and sqrt that work seamlessly on tiles.

Example: A row-normalization kernel uses Julia's standard broadcasting syntax (tile .- mean) rather than explicit loops, making the code intuitive for Julia programmers.

Performance Results

Testing on NVIDIA GeForce RTX 5080 (Blackwell architecture) shows strong performance parity with cuTile Python for compute-intensive kernels:

Kernel cuTile.jl Python Parity
Vector addition 838 GB/s 843 GB/s 99%
Matrix transpose 797 GB/s 812 GB/s 98%
Matrix multiplication 50.9 TFLOPS 50.5 TFLOPS 100%
Batch matrix multiply 43.0 TFLOPS 47.5 TFLOPS 91%

More complex kernels with intricate control flow (layer normalization, FFT) still show performance gaps as the compiler matures.

Availability

The package is available at github.com/JuliaGPU/cuTile.jl and targets the same NVIDIA Tile IR backend as cuTile Python, ensuring compatibility with existing NVIDIA hardware.