Replace short training regression tests with weekly and manual long training runs

We currently maintain a set of short regression tests (1–2 epochs) for our architectures to verify model training behavior. These tests are intended to catch unintended changes in training dynamics, i.e. when doing code cleanups or speed improvements. However, due to the inherent non-determinism in training — especially under distributed settings — even fixed seeds cannot fully eliminate variability. As a result, these tests are brittle and prone to spurious failures. For example, @jwa7  ran into this issue while working on fixes for the composition model.

In today’s ML dev meeting, we discussed replacing these short regression tests with longer, more representative training runs. Now that we have access to the CSCS infrastructure thanks to @RMeli and we could feasibly run longer regression tests. 

The idea is for architecture developers to define longer training runs (potentially taking several hours) that are not part of the standard CI but are triggered:

- Manually, e.g., from a PR when needed
- Periodically, e.g., through a weekly scheduled job

This setup would give us more meaningful regression signals while avoiding false positives during regular development. If we deployed these tests we could remove the unstable current tests.

What do you think?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace short training regression tests with weekly and manual long training runs #705

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Replace short training regression tests with weekly and manual long training runs #705

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions