New Determinism Controls for CUB Reductions
NVIDIA CCCL 3.1 adds explicit control over floating-point determinism in the CUB library's reduction algorithms through a new single-phase API. Developers can now specify determinism levels via an execution environment, addressing a critical challenge in parallel computing: ensuring reproducible results across multiple runs.
Three Determinism Modes
The update introduces three configurable determinism levels:
not_guaranteed: Uses atomic operations and single-kernel execution for maximum performance, but may produce slightly different results across runs due to non-deterministic thread execution order.run_to_run(default): Guarantees bitwise-identical results across multiple invocations on the same GPU using fixed hierarchical reduction trees with shuffle instructions and shared memory operations.gpu_to_gpu: Ensures reproducible results across different GPUs using the Reproducible Floating-point Accumulator (RFA) technique, with a 20-30% performance penalty for large datasets.
Implementation Details
The new API leverages NVIDIA's execution environment abstraction, allowing developers to specify determinism requirements at runtime:
auto env = cuda::execution::require(cuda::execution::determinism::not_guaranteed);
cub::DeviceReduce::Sum(input.begin(), output.begin(), input.size(), env);
This is only available through the new single-phase API; the legacy two-phase API does not support determinism configuration. The implementation addresses the fundamental mathematical challenge that floating-point addition and multiplication are not strictly associative due to rounding errors with finite precision.
Use Cases and Trade-offs
Applications requiring strict reproducibility—such as scientific computing, financial modeling, or regulatory compliance—can now opt for GPU-to-GPU determinism despite performance costs. Performance-critical applications can use not_guaranteed mode for maximum throughput, while general-purpose applications default to run_to_run determinism for consistent results without overhead.