CUDA 13.2 Expands Tile Support to Ampere, Ada, and Blackwell with Enhanced Python Features

CUDA Tile Python Enhancements

CUDA Tile, NVIDIA's tensor programming model, is now fully supported on compute capability 8.X (Ampere and Ada) and 10.X, 11.X, and 12.X (Blackwell) architectures. The cuTile Python DSL has received substantial feature enhancements, enabling developers to write more flexible GPU code:

Recursive functions and closures with capture (including lambda functions and nested functions)
Custom reduction and scan functions for specialized data processing
Type-annotated assignments for improved code clarity
Enhanced array slicing with Array.slice to create views on subarrays

Installation is simplified with a single pip command: pip install cuda-tile[tileiras] pulls in all dependencies without requiring a separate system-wide CUDA Toolkit installation.

Core Runtime and Memory Improvements

CUDA 13.2 introduces several critical runtime enhancements for memory management and device interaction:

New cudaMemcpyWithAttributesAsync and cudaMemcpy3DWithAttributesAsync APIs simplify single memory transfers with attribute control, avoiding the need for batched API calls
Windows WDDM local memory footprint reduction significantly decreases LMEM usage for register spilling and stack variables, benefiting memory-constrained vGPU environments
Memory pool property querying via cudaMemPoolGetAttribute allows developers to inspect and replicate memory pool configurations
Default Windows GPU driver mode shift from TCC to MCDM improves compatibility and expands feature access

Math Libraries and Developer Tooling

The math library ecosystem expands with experimental Grouped GEMM with MXFP8 support in cuBLAS for Blackwell GPUs, and FP64-emulated cuSOLVERD APIs that provide significant performance gains on INT8-dominant platforms for QR, LU, and Cholesky factorizations.

Developer productivity tools see major updates: NVIDIA Nsight Python enables integrated kernel profiling directly in Python, Numba-CUDA kernel debugging gains initial support, and Nsight Compute adds report clustering and register dependency visualization. CCCL 3.2 introduces cub::DeviceTopK for efficient Top-K selection and new segmented scan and binary search primitives, with improved interoperability across CUDA Python and CuPy frameworks.

Getting Started

Developers using Ampere, Ada, or Blackwell architectures should consult the cuTile Python Quickstart guide to begin leveraging the new programming capabilities.

CUDA Tile Python Enhancements

Core Runtime and Memory Improvements

Math Libraries and Developer Tooling

Getting Started

Tags

Published

Source