Skip to content

[tests] Run trainings lazily.#1015

Open
pfebrer wants to merge 8 commits intomainfrom
tests_lazytraining
Open

[tests] Run trainings lazily.#1015
pfebrer wants to merge 8 commits intomainfrom
tests_lazytraining

Conversation

@pfebrer
Copy link
Contributor

@pfebrer pfebrer commented Jan 21, 2026

Motivation

Whenever I wanted to run an isolated test with tox -e tests the generate-outputs.sh script was run and therefore I had to wait for all the training runs to finish just to run a test that doesn't need them.

Alternatives until now

  • Run directly pytest, which is fine but I think it is better if we can run it with the true testing environment.
  • Comment the line that runs the trainings in tox.ini, super annoying to do it each time and remember to undo it when pushing changes.

Implementation in this PR

The model paths are made a fixture. They run only when a test needs them. At that point, we can run the training for that specific model.

In this way, trainings are run only if/when needed.

Complications

We support running tests in parallel, so we have to make sure that if two processes request the fixture they don't train both at the same time. I used a simple lockfile so that the workers wait for the worker that is doing the training.

Nice side effects

Since the trainings are ran by each model, in principle it is possible that the trainings run in parallel. However, I think this could not be the case because it will happen that different workers ask for the same training, and all but one worker will be just idle. Some smarter splitting of the tests would ensure that trainings are run in parallel, but I don't know if it's possible (haven't loooked into it).


📚 Documentation preview 📚: https://metatrain--1015.org.readthedocs.build/en/1015/

@pfebrer
Copy link
Contributor Author

pfebrer commented Jan 21, 2026

There are still some tests that are not using the fixtures, and that is why they are failing, will fix it later.

@pfebrer pfebrer force-pushed the tests_lazytraining branch from 1ae874b to d0e20fb Compare January 21, 2026 18:11
Copy link
Member

@Luthaf Luthaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! I like using upper case names for global fixtures.

@Luthaf
Copy link
Member

Luthaf commented Jan 22, 2026

cscs-ci run

@pfebrer
Copy link
Contributor Author

pfebrer commented Jan 22, 2026

Ok, I tried to run the tests locally with a GPU and they pass, so I don't know exactly what is going on here, I will try to investigate. The training script fails when running mtt train. I will try to make it print the output of mtt train which is in any case something that I wanted to do, because otherwise I have made it more opaque 😅

@Luthaf
Copy link
Member

Luthaf commented Feb 5, 2026

cscs-ci run

@PicoCentauri
Copy link
Contributor

This looks good to me! I like using upper case names for global fixtures.

We could write this down somewhere. For the LLM and for us to remember.

@Luthaf
Copy link
Member

Luthaf commented Feb 6, 2026

CI failure seems relevant!

@pfebrer
Copy link
Contributor Author

pfebrer commented Feb 6, 2026

Yes yes, I have to understand what is going on 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants