Skip to content

Note on Reproducibility Across Different GPU Nodes #111

@aditya0by0

Description

@aditya0by0

Note on Reproducibility Across Different GPU Nodes

Over the past few days, I noticed slight inconsistencies in training runs, even after enabling deterministic settings as introduced in the above PR. Initially, I suspected these mismatches were due to subsequent changes I made in my branch of the graph repository. However, after a thorough investigation and investing quite some time, I found that the discrepancies were not related to any code differences.

Instead, the source of the inconsistency turned out to be the underlying GPU node architecture.

I ran the exact same training configuration across three different HPC nodes:

  • hpc3-52 (A100)
  • hpc3-53 (A100)
  • hpc3-54 (H100)

Observations:

  • The training runs on hpc3-52 and hpc3-53 produced identical results, down to the last decimal.
  • The run on hpc3-54, however, diverged slightly in both metrics and intermediate outputs.

You can view the runs here:

Hypothesis:
This divergence is very likely due to hardware-level differences in GPU architecture (A100 vs. H100). Despite enabling PyTorch's deterministic training flags, low-level operations such as matrix multiplications, convolutions, and fused kernels may still behave slightly differently across GPU generations, especially with newer hardware like H100 where default precision modes (e.g., TensorFloat-32) or kernel fusions may vary.


Proposal:
@sfluegel05
I suggest we add a dedicated note on this in the README or the GitHub Wiki under a section like "Reproducibility Caveats".

This will:

  • Warn users that results may differ slightly between GPU types (even when using the same training seed and config).
  • Help prevent confusion for future users attempting to replicate experiments across heterogeneous hardware environments.
  • Encourage running critical ablation studies or baselines on consistent hardware to ensure fair comparison.

Let me know what you think. I'm happy to help write this section if we decide to include it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentationpriority: lowIssue with low priority

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions