CCCL 3.1 Introduces Determinism Controls
NVIDIA's CUDA Core Compute Libraries (CCCL) 3.1 now provides developers with explicit control over floating-point determinism in reduction operations through a new single-phase API for CUB (CUDA Unbound), a low-level library for parallel device algorithms. This addresses a fundamental challenge in parallel computing: floating-point arithmetic isn't strictly associative, meaning (a + b) + c may not equal a + (b + c) due to rounding errors with finite precision.
Three Determinism Levels
The new API offers three configurable determinism levels via the execution environment:
- not_guaranteed: Uses atomic operations and single kernel launches for maximum performance. Results may vary slightly between runs due to unordered atomic execution across threads, making this ideal for applications tolerant of minor floating-point differences.
- run_to_run: The default mode, guarantees bitwise-identical results across multiple runs on the same GPU using fixed hierarchical reduction trees. This ensures reproducibility without performance penalties compared to not_guaranteed mode.
- gpu_to_gpu: Provides the highest reproducibility level, guaranteeing identical results across different GPUs using a Reproducible Floating-point Accumulator (RFA). This comes with a 20-30% performance overhead for large datasets but ensures strict reproducibility and tighter error bounds.
Implementation
Developers specify determinism using cuda::execution::require() to construct an execution environment object. The single-phase API accepts this environment parameter, while the legacy two-phase API does not support determinism configuration. The example provided demonstrates straightforward usage: construct input/output vectors, create an execution environment with the desired determinism level, and pass it to cub::DeviceReduce::Sum() or other reduction functions.