← Back
NVIDIA
NVIDIA CCCL 3.1 adds configurable floating-point determinism for GPU reductions
· releasefeatureapisdkperformance · developer.nvidia.com ↗

Floating-Point Determinism Now Configurable

NVIDIA CUDA Core Compute Libraries (CCCL) 3.1 introduces a new execution environment API that gives developers explicit control over reduction determinism. The update targets a fundamental challenge in parallel computing: floating-point arithmetic is not strictly associative, meaning (a + b) + c may not equal a + (b + c) due to rounding. This makes it difficult to guarantee identical results across multiple runs, especially in GPU parallel code.

Three Determinism Levels

The new single-phase API accepts an execution environment parameter that configures the reduce algorithm's determinism property:

  • not_guaranteed: Fastest mode using atomic operations and single kernel launch; results may vary between runs
  • run_to_run (default): Uses fixed hierarchical reduction trees to guarantee identical bitwise results across multiple runs on the same GPU
  • gpu_to_gpu: Employs the Reproducible Floating-point Accumulator (RFA) to ensure bitwise-identical results across different GPUs; 20-30% slower for large datasets

Implementation Details

The run_to_run mode structures reductions as a fixed tree: elements combine within threads, intermediate results reduce across warps using shuffle instructions, then across blocks using shared memory, finally with a second kernel aggregating per-block results. This predetermined sequence ensures consistent outputs regardless of thread execution timing.

GPU-to-GPU determinism uses the RFA technique to group inputs by exponent bins, enabling strict reproducibility and tighter error bounds at the cost of increased execution time. Developers can now choose the determinism level that best fits their application's requirements for correctness versus performance.

How to Use

The API is accessed through cuda::execution::require() to construct an execution environment object:

auto env = cuda::execution::require(cuda::execution::determinism::gpu_to_gpu);
cub::DeviceReduce::Sum(input.begin(), output.begin(), input.size(), env);

The determinism level can be changed by selecting not_guaranteed, run_to_run, or gpu_to_gpu. Note that this feature is only available through the new single-phase API, not the legacy two-phase API.