← Back
NVIDIA
NVIDIA CCCL 3.1 adds floating-point determinism controls to GPU reduction algorithms
· releaseapifeatureperformance · developer.nvidia.com ↗

New Determinism Controls for CUB Reductions

NVIDIA CCCL 3.1 adds explicit control over floating-point determinism in the CUB library's reduction algorithms through a new single-phase API. Developers can now specify determinism levels via an execution environment, addressing a critical challenge in parallel computing: ensuring reproducible results across multiple runs.

Three Determinism Modes

The update introduces three configurable determinism levels:

  • not_guaranteed: Uses atomic operations and single-kernel execution for maximum performance, but may produce slightly different results across runs due to non-deterministic thread execution order.
  • run_to_run (default): Guarantees bitwise-identical results across multiple invocations on the same GPU using fixed hierarchical reduction trees with shuffle instructions and shared memory operations.
  • gpu_to_gpu: Ensures reproducible results across different GPUs using the Reproducible Floating-point Accumulator (RFA) technique, with a 20-30% performance penalty for large datasets.

Implementation Details

The new API leverages NVIDIA's execution environment abstraction, allowing developers to specify determinism requirements at runtime:

auto env = cuda::execution::require(cuda::execution::determinism::not_guaranteed);
cub::DeviceReduce::Sum(input.begin(), output.begin(), input.size(), env);

This is only available through the new single-phase API; the legacy two-phase API does not support determinism configuration. The implementation addresses the fundamental mathematical challenge that floating-point addition and multiplication are not strictly associative due to rounding errors with finite precision.

Use Cases and Trade-offs

Applications requiring strict reproducibility—such as scientific computing, financial modeling, or regulatory compliance—can now opt for GPU-to-GPU determinism despite performance costs. Performance-critical applications can use not_guaranteed mode for maximum throughput, while general-purpose applications default to run_to_run determinism for consistent results without overhead.