CUDA Tile Expansion and Python Enhancements
CUDA 13.2 significantly expands the CUDA Tile programming model, now fully supported on compute capability 8.X (Ampere, Ada), 10.X, 11.X, and 12.X (Blackwell) architectures. The cuTile Python DSL receives major feature enhancements including support for recursive functions, closures with capture, custom reduction and scan functions, type-annotated assignments, and improved array slicing operations. Installation is simplified with a single pip command (pip install cuda-tile[tileiras]) that automatically pulls all dependencies without requiring separate CUDA Toolkit installation.
Core Runtime and Memory Management Improvements
The core CUDA runtime introduces two new cudaMemcpyWithAttributesAsync and cudaMemcpy3DWithAttributesAsync APIs, enabling flexible memory transfers with attribute control without requiring the batched interface. Windows users benefit from significant local memory (LMEM) footprint reduction in WDDM driver mode, particularly beneficial for memory-constrained vGPU environments. A new cudaMemPoolGetAttribute API allows developers to query memory pool properties, enabling programmatic pool creation and management. The default Windows GPU driver mode shifts from TCC to MCDM for improved compatibility and feature access.
Math Libraries and Compute Performance
Mathematical libraries receive substantial upgrades with experimental Grouped GEMM support featuring MXFP8 data type in cuBLAS for Blackwell GPUs. The release introduces FP64-emulated cuSOLVER APIs delivering significant performance gains on INT8-dominant platforms, with notable improvements in QR, LU, and Cholesky factorizations. These additions expand the toolkit's capabilities for mixed-precision and specialized compute workloads.
Developer Tooling and Profiling Ecosystem
Developer productivity improvements include NVIDIA Nsight Python for integrated kernel profiling directly within Python environments, initial support for Numba-CUDA kernel debugging, and enhanced Nsight Compute features such as report clustering and register dependency visualization. Nsight Cloud and the new Nsight Copilot (an AI-assisted CUDA programming assistant) provide advanced profiling and development support. CUDA Core Compute Libraries (CCCL) 3.2 delivers modern C++ runtime APIs with new primitives including cub::DeviceTopK for efficient Top-K selection, segmented reduction and scan operations, and improved interoperability with CUDA Python and CuPy frameworks.