-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Note on Reproducibility Across Different GPU Nodes
- (related to Make training deterministic #101 (comment) and issue Runs should be reproducable #12)
Over the past few days, I noticed slight inconsistencies in training runs, even after enabling deterministic settings as introduced in the above PR. Initially, I suspected these mismatches were due to subsequent changes I made in my branch of the graph
repository. However, after a thorough investigation and investing quite some time, I found that the discrepancies were not related to any code differences.
Instead, the source of the inconsistency turned out to be the underlying GPU node architecture.
I ran the exact same training configuration across three different HPC nodes:
hpc3-52
(A100)hpc3-53
(A100)hpc3-54
(H100)
Observations:
- The training runs on
hpc3-52
andhpc3-53
produced identical results, down to the last decimal. - The run on
hpc3-54
, however, diverged slightly in both metrics and intermediate outputs.
You can view the runs here:
- A100 node (
hpc3-52
): https://wandb.ai/chebai/chebai/runs/kpjpkvn3/overview - A100 node (
hpc3-53
): https://wandb.ai/chebai/chebai/runs/9t5oecif/overview - H100 node (
hpc3-54
): https://wandb.ai/chebai/chebai/runs/wg1c0k8z/overview
Hypothesis:
This divergence is very likely due to hardware-level differences in GPU architecture (A100 vs. H100). Despite enabling PyTorch's deterministic training flags, low-level operations such as matrix multiplications, convolutions, and fused kernels may still behave slightly differently across GPU generations, especially with newer hardware like H100 where default precision modes (e.g., TensorFloat-32) or kernel fusions may vary.
Proposal:
@sfluegel05
I suggest we add a dedicated note on this in the README or the GitHub Wiki under a section like "Reproducibility Caveats".
This will:
- Warn users that results may differ slightly between GPU types (even when using the same training seed and config).
- Help prevent confusion for future users attempting to replicate experiments across heterogeneous hardware environments.
- Encourage running critical ablation studies or baselines on consistent hardware to ensure fair comparison.
Let me know what you think. I'm happy to help write this section if we decide to include it.