← Back
NVIDIA
NVIDIA CCCL 3.1 adds floating-point determinism controls to reduce algorithm
· releasefeatureapiperformance · developer.nvidia.com ↗

Floating-Point Determinism Control in CCCL 3.1

NVIDIA CUDA Core Compute Libraries (CCCL) 3.1 has added a new single-phase API to the CUB library that enables explicit control over reduction determinism. Developers can now specify their reproducibility requirements at runtime through an execution environment, choosing between three determinism levels based on their application's needs.

Three Determinism Modes

not_guaranteed: The fastest option, using atomic operations and single kernel launches to maximize performance. Results may differ slightly between runs due to non-deterministic operation ordering, making this suitable for applications where minor numerical differences are acceptable.

run_to_run: The default mode, guaranteeing identical bitwise results across multiple runs on the same GPU with identical input and kernel configuration. This is achieved through a fixed hierarchical reduction tree that eliminates order-dependent variations while maintaining reasonable performance.

gpu_to_gpu: The strictest mode, ensuring bitwise-identical results across different GPUs using NVIDIA's Reproducible Floating-point Accumulator (RFA). This mode groups inputs by exponent bins for reproducibility but incurs a 20-30% performance penalty for large datasets.

Implementation and Usage

The new API is accessed through cuda::execution::require() to construct an execution environment with the desired determinism level. This approach gives developers fine-grained control without requiring API changes—developers simply pass the environment to cub::DeviceReduce::Sum() or other reduction operations.

Key Action Items:

  • Update to CCCL 3.1 to access determinism controls
  • Choose determinism level based on performance vs. reproducibility trade-offs
  • Use the single-phase API rather than the two-phase API, which doesn't support execution environments
  • Consider gpu_to_gpu mode only when cross-GPU reproducibility is critical due to performance impact