CUDA Tile Expansion and Python Enhancements
CUDA Tile is now fully supported on compute capability 8.X (Ampere, Ada), 10.X, and 12.X (Blackwell) GPU architectures. cuTile Python, the Python domain-specific language for CUDA Tile, introduces significant language enhancements including recursive functions, closures with capture, custom reduction and scan functions, type-annotated assignments, and improved array slicing. Installation is simplified via pip install cuda-tile[tileiras], eliminating the need for separate CUDA Toolkit installation.
Core Runtime and Memory Improvements
Two new API functions—cudaMemcpyWithAttributesAsync and cudaMemcpy3DWithAttributesAsync—simplify single memory transfer operations with attribute control, eliminating the need to use batched APIs for individual transfers. On Windows in WDDM driver mode, local memory (LMEM) footprint has been significantly reduced, particularly benefiting memory-constrained vGPU environments. New cudaMemPoolGetAttribute APIs enable developers to query memory pool properties, enabling use cases like creating identical memory pools from existing ones.
Math Libraries and Performance Advances
Math library updates include experimental Grouped GEMM with MXFP8 support in cuBLAS for Blackwell GPUs, and FP64-emulated cuSOLVERD APIs delivering significant performance gains on INT8-dominant platforms for QR, LU, and Cholesky factorizations. These additions provide developers with optimized paths for modern workloads.
Developer Tooling Enhancements
Developer productivity tools have been significantly expanded. NVIDIA Nsight Python adds integrated kernel profiling directly within Python environments, while initial support for Numba-CUDA kernel debugging is now available. Nsight Compute gains report clustering and register dependency visualization, Nsight Systems improves PyTorch profiling and GPUDirect Storage support, and Nsight Copilot provides an AI-powered CUDA assistant.
CCCL 3.2 Updates
CUDA Core Compute Libraries (CCCL) 3.2 delivers modern, idiomatic CUDA C++ runtime APIs and introduces cub::DeviceTopK for efficient Top-K selection, fixed-size segmented reduction variants, new segmented scan and binary search primitives. Integration with updated CUDA Python and CuPy ensures seamless interoperability across frameworks and streamlined development workflows.