-
Notifications
You must be signed in to change notification settings - Fork 12
Description
We currently maintain a set of short regression tests (1–2 epochs) for our architectures to verify model training behavior. These tests are intended to catch unintended changes in training dynamics, i.e. when doing code cleanups or speed improvements. However, due to the inherent non-determinism in training — especially under distributed settings — even fixed seeds cannot fully eliminate variability. As a result, these tests are brittle and prone to spurious failures. For example, @jwa7 ran into this issue while working on fixes for the composition model.
In today’s ML dev meeting, we discussed replacing these short regression tests with longer, more representative training runs. Now that we have access to the CSCS infrastructure thanks to @RMeli and we could feasibly run longer regression tests.
The idea is for architecture developers to define longer training runs (potentially taking several hours) that are not part of the standard CI but are triggered:
- Manually, e.g., from a PR when needed
- Periodically, e.g., through a weekly scheduled job
This setup would give us more meaningful regression signals while avoiding false positives during regular development. If we deployed these tests we could remove the unstable current tests.
What do you think?