Skip to content

Replace short training regression tests with weekly and manual long training runs #705

@PicoCentauri

Description

@PicoCentauri

We currently maintain a set of short regression tests (1–2 epochs) for our architectures to verify model training behavior. These tests are intended to catch unintended changes in training dynamics, i.e. when doing code cleanups or speed improvements. However, due to the inherent non-determinism in training — especially under distributed settings — even fixed seeds cannot fully eliminate variability. As a result, these tests are brittle and prone to spurious failures. For example, @jwa7 ran into this issue while working on fixes for the composition model.

In today’s ML dev meeting, we discussed replacing these short regression tests with longer, more representative training runs. Now that we have access to the CSCS infrastructure thanks to @RMeli and we could feasibly run longer regression tests.

The idea is for architecture developers to define longer training runs (potentially taking several hours) that are not part of the standard CI but are triggered:

  • Manually, e.g., from a PR when needed
  • Periodically, e.g., through a weekly scheduled job

This setup would give us more meaningful regression signals while avoiding false positives during regular development. If we deployed these tests we could remove the unstable current tests.

What do you think?

Metadata

Metadata

Assignees

No one assigned

    Labels

    DiscussionIssues to be discussed by the contributorsEnhancementIdea or improvementPriority: MediumImportant issues to address after high priority.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions