← Back
NVIDIA
cuTile.jl brings NVIDIA CUDA tile-based programming to Julia with near-parity performance
· sdkfeatureperformanceplatform · developer.nvidia.com ↗

What is cuTile.jl?

cuTile.jl extends NVIDIA's CUDA Tile programming model to the Julia ecosystem. This new package allows developers to write high-performance GPU kernels by describing operations on tiles of data rather than managing individual threads and memory hierarchies manually. The compiler automatically handles the mapping to hardware and tensor cores.

Key Features and Design

The package maintains deliberate syntax and abstraction parity with cuTile Python, making code porting straightforward and allowing developers to leverage existing Python documentation. At the same time, cuTile.jl embraces Julia idioms wherever possible:

  • Julia-native syntax: Uses 1-based indexing and broadcast expressions (.^, .-, ./) for element-wise operations
  • Standard library integration: Augments common Julia functions like sum, size, and sqrt to work on tiles
  • Readable kernels: Complex operations like row normalization read as idiomatic Julia array code

Performance Results

cuTile.jl targets the same NVIDIA Tile IR backend as the Python version, producing identical GPU machine code. On NVIDIA GeForce RTX 5080 (Blackwell architecture), compute-intensive kernels achieve strong performance parity:

  • Vector addition: 838 GB/s (99% vs. Python)
  • Matrix transpose: 797 GB/s (98% vs. Python)
  • Matrix multiplication: 50.9 TFLOPS (100% vs. Python)
  • Batch matrix multiply: 43.0 TFLOPS (91% vs. Python)

Kernels with more complex control flow (such as layer normalization and FFT) currently lag slightly as the compiler matures, with the team actively addressing these known issues.

Developer Impact

Julia programmers can now write GPU kernels that are closer to regular Julia array code while still accessing tensor cores and specialized hardware. The bridge between CPU and GPU code is narrower, making it easier to share and reuse logic across platforms.