← Back
NVIDIA
NVIDIA releases CUDA 13.2 with expanded Tile support, Python enhancements, and new developer tools
· releasefeatureapiperformancesdk · developer.nvidia.com ↗

CUDA Tile and Python Expansion

CUDA 13.2 delivers full support for NVIDIA Tile on compute capability 8.X (Ampere, Ada), 10.X, and 12.X (Blackwell) architectures. The cuTile Python domain-specific language gains significant enhancements including support for recursive functions, closures with capture, custom reduction and scan functions, type-annotated assignments, and improved array slicing with Array.slice for subarray views. Installation is simplified through a single pip command: pip install cuda-tile[tileiras].

Core Runtime and Memory Improvements

Two new async APIs—cudaMemcpyWithAttributesAsync and cudaMemcpy3DWithAttributesAsync—enable flexible memory transfers with attributes without requiring the batched API interface. Existing cudaMemcpyAsync calls now support attribute overloading for backward-compatible migration. Windows WDDM mode now features per-context local memory (LMEM) footprint reduction in conjunction with CUDA Driver R595, significantly benefiting memory-constrained vGPU environments. New cudaMemPoolGetAttribute API allows querying memory pool properties for creation and management of identically-configured pools.

Math Libraries and Performance

cuBLAS introduces experimental Grouped GEMM with MXFP8 support for Blackwell GPUs, enabling efficient mixed-precision operations. cuSOLVERD adds FP64-emulated APIs delivering substantial performance gains on INT8-dominant platforms, particularly in QR, LU, and Cholesky factorizations.

Developer Tools

NVIDIA Nsight Python adds integrated kernel profiling capabilities directly within Python workflows, while initial support for Numba-CUDA kernel debugging enables developers to debug GPU kernels written in Numba. Nsight Compute gains report clustering and register dependency visualization features. CCCL 3.2 provides modern C++ runtime APIs with new cub::DeviceTopK for efficient Top-K selection, fixed-size segmented reduction variants, segmented scan, and binary search primitives, with improved interoperability with CUDA Python and CuPy.