← Back
NVIDIA
CUDA 13.2 brings CUDA Tile to Ampere/Ada/Blackwell GPUs, adds Python profiling and math library improvements
· releasefeatureapisdkplatformperformance · developer.nvidia.com ↗

CUDA Tile Expansion

CUDA 13.2 extends CUDA Tile support to compute capability 8.X (Ampere and Ada), 10.X, and 12.X (Blackwell) architectures. The cuTile Python DSL receives significant enhancements including recursive functions, closures with capture, custom reduction and scan functions, type-annotated assignments, and improved array slicing. Installation is simplified with a single pip command: pip install cuda-tile[tileiras].

Core Runtime Improvements

The release introduces new memory transfer APIs (cudaMemcpyWithAttributesAsync and cudaMemcpy3DWithAttributesAsync) that enable flexible control over memory operations without requiring the batched interface. Windows WDDM mode now benefits from per-context local memory footprint reduction, improving efficiency in memory-constrained environments. New cudaMemPoolGetAttribute APIs allow querying memory pool properties, facilitating programmatic pool creation and management.

Math and Compiler Libraries

cuBLAS adds experimental grouped GEMM support with MXFP8 precision for Blackwell GPUs. cuSOLVERD introduces FP64-emulated APIs with significant performance gains on INT8-dominant platforms for QR, LU, and Cholesky factorizations. CCCL 3.2 delivers modern C++ runtime APIs, introduces cub::DeviceTopK for efficient top-K selection, adds fixed-size segmented reduction, and provides new segmented scan and binary search primitives.

Developer Tooling

NVIDIA Nsight Python enables integrated kernel profiling directly in Python environments. Initial support for Numba-CUDA kernel debugging is available. Nsight Compute receives report clustering and register dependency visualization. Nsight Copilot launches as an AI-powered CUDA assistant, while Nsight Systems gains PyTorch profiling improvements and GPUDirect Storage updates for enhanced performance analysis across frameworks.

Getting Started

Developers using Ampere, Ada, or Blackwell architectures should reference the cuTile Python Quickstart guide. The toolkit is available for download from the CUDA Downloads page.