Note on Reproducibility Across Different GPU Nodes



**Note on Reproducibility Across Different GPU Nodes**
- (related to https://github.com/ChEB-AI/python-chebai/pull/101#issuecomment-3041172770 and issue https://github.com/ChEB-AI/python-chebai/issues/12)

Over the past few days, I noticed slight inconsistencies in training runs, even after enabling deterministic settings as introduced in the above PR. Initially, I suspected these mismatches were due to subsequent changes I made in my branch of the `graph` repository. However, after a thorough investigation and investing quite some time, I found that the discrepancies were not related to any code differences.

Instead, the source of the inconsistency turned out to be the underlying GPU node architecture.

I ran the **exact same training configuration** across three different HPC nodes:

* `hpc3-52` (A100)
* `hpc3-53` (A100)
* `hpc3-54` (H100)

**Observations:**

* The training runs on `hpc3-52` and `hpc3-53` produced identical results, down to the last decimal.
* The run on `hpc3-54`, however, diverged slightly in both metrics and intermediate outputs.

You can view the runs here:

* A100 node (`hpc3-52`): [https://wandb.ai/chebai/chebai/runs/kpjpkvn3/overview](https://wandb.ai/chebai/chebai/runs/kpjpkvn3/overview)
* A100 node (`hpc3-53`): [https://wandb.ai/chebai/chebai/runs/9t5oecif/overview](https://wandb.ai/chebai/chebai/runs/9t5oecif/overview)
* H100 node (`hpc3-54`): [https://wandb.ai/chebai/chebai/runs/wg1c0k8z/overview](https://wandb.ai/chebai/chebai/runs/wg1c0k8z/overview)

**Hypothesis:**
This divergence is very likely due to hardware-level differences in GPU architecture (A100 vs. H100). Despite enabling PyTorch's deterministic training flags, low-level operations such as matrix multiplications, convolutions, and fused kernels may still behave slightly differently across GPU generations, especially with newer hardware like H100 where default precision modes (e.g., TensorFloat-32) or kernel fusions may vary.

---

**Proposal:**
@sfluegel05 
I suggest we add a dedicated note on this in the **README** or the **GitHub Wiki** under a section like *"Reproducibility Caveats"*.

This will:

* Warn users that results may differ slightly between GPU types (even when using the same training seed and config).
* Help prevent confusion for future users attempting to replicate experiments across heterogeneous hardware environments.
* Encourage running critical ablation studies or baselines on consistent hardware to ensure fair comparison.

Let me know what you think. I'm happy to help write this section if we decide to include it.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Note on Reproducibility Across Different GPU Nodes #111

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Note on Reproducibility Across Different GPU Nodes #111

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions