What Changed
NVIDIA CCCL 3.1 added a new single-phase API to CUB's reduce algorithm that accepts an execution environment parameter, enabling explicit control over floating-point determinism. This addresses a core challenge in parallel computing: floating-point addition and multiplication aren't strictly associative due to rounding, so operation ordering directly impacts results.
Three Determinism Levels
Not Guaranteed: Uses atomic operations and single kernel launches for maximum performance, but allows minor variations across runs due to unordered atomic execution.
Run-to-Run (Default): Implements a fixed hierarchical reduction tree—combining elements within threads, then warps, then blocks, finally aggregating per-block results. Guarantees identical bitwise results across multiple runs on the same GPU.
GPU-to-GPU: Uses the Reproducible Floating-point Accumulator (RFA) to ensure bitwise-identical results across different GPUs by grouping inputs into exponent bins. Trades 20-30% increased execution time for strict cross-GPU reproducibility.
Developer Usage
The new API is straightforward: developers construct a cuda::std::execution::env object via cuda::execution::require() and pass it to reduction functions. The two-phase API does not support this environment parameter, so the single-phase API must be used to customize determinism behavior.
Impact
This feature is critical for scientific computing, machine learning reproducibility, and regulated industries requiring deterministic calculations. The flexibility of three modes allows developers to optimize for their specific requirements—balancing performance needs against reproducibility demands.