← Back
NVIDIA
CUDA 13.2 Expands Tile Support to Ampere, Ada, and Blackwell with Enhanced Python Features
· releasefeatureapisdkperformance · developer.nvidia.com ↗

CUDA Tile Python Enhancements

CUDA Tile, NVIDIA's tensor programming model, is now fully supported on compute capability 8.X (Ampere and Ada) and 10.X, 11.X, and 12.X (Blackwell) architectures. The cuTile Python DSL has received substantial feature enhancements, enabling developers to write more flexible GPU code:

  • Recursive functions and closures with capture (including lambda functions and nested functions)
  • Custom reduction and scan functions for specialized data processing
  • Type-annotated assignments for improved code clarity
  • Enhanced array slicing with Array.slice to create views on subarrays

Installation is simplified with a single pip command: pip install cuda-tile[tileiras] pulls in all dependencies without requiring a separate system-wide CUDA Toolkit installation.

Core Runtime and Memory Improvements

CUDA 13.2 introduces several critical runtime enhancements for memory management and device interaction:

  • New cudaMemcpyWithAttributesAsync and cudaMemcpy3DWithAttributesAsync APIs simplify single memory transfers with attribute control, avoiding the need for batched API calls
  • Windows WDDM local memory footprint reduction significantly decreases LMEM usage for register spilling and stack variables, benefiting memory-constrained vGPU environments
  • Memory pool property querying via cudaMemPoolGetAttribute allows developers to inspect and replicate memory pool configurations
  • Default Windows GPU driver mode shift from TCC to MCDM improves compatibility and expands feature access

Math Libraries and Developer Tooling

The math library ecosystem expands with experimental Grouped GEMM with MXFP8 support in cuBLAS for Blackwell GPUs, and FP64-emulated cuSOLVERD APIs that provide significant performance gains on INT8-dominant platforms for QR, LU, and Cholesky factorizations.

Developer productivity tools see major updates: NVIDIA Nsight Python enables integrated kernel profiling directly in Python, Numba-CUDA kernel debugging gains initial support, and Nsight Compute adds report clustering and register dependency visualization. CCCL 3.2 introduces cub::DeviceTopK for efficient Top-K selection and new segmented scan and binary search primitives, with improved interoperability across CUDA Python and CuPy frameworks.

Getting Started

Developers using Ampere, Ada, or Blackwell architectures should consult the cuTile Python Quickstart guide to begin leveraging the new programming capabilities.