-
Notifications
You must be signed in to change notification settings - Fork 6
Reproducibility Caveats
To guarantee that our experiments and results can be reproduced reliably, we take several measures. Exact bit-level reproducibility depends on multiple factors, each of which is addressed in our pipeline:
Reproducibility begins with ensuring that all experiments use the same ChEBI data version. Use:
--data.chebi_version=<version_number>
This explicitly fixes the dataset version and prevents discrepancies caused by upstream changes.
To reproduce identical train/validation/test splits, we fix the split seed at 42 by default. Override via:
--data.dynamic_data_split_seed=<seed>
This ensures consistent sampling across different runs or environments.
Model initialization randomness is controlled by PyTorch Lightning’s seed_everything, which is defaulted to 0:
--seed_everything=<seed>
This guarantees that model parameters start from the same initial values for every run.
Achieving true reproducibility also depends on making the underlying computational operations deterministic. For reference, see:
- Randomness flags: https://gist.github.com/ihoromi4/b681a9088f348942b01711f251e5f964
- PyTorch Randomness Documentation: https://pytorch.org/docs/stable/notes/randomness.html
Lightning wraps these low-level settings under a single argument:
trainer(deterministic=True)
We already enable this by default (see PR #101: https://github.com/ChEB-AI/python-chebai/pull/101).
Even with deterministic settings, results may differ if different numerical precisions are used (e.g., float16 vs float32). We use 32-bit precision (float32) which is the default precision. Lightning precision docs: https://lightning.ai/docs/pytorch/stable/common/precision_basic.html
To change precision:
--trainer.precision=16-mixed
Even after controlling all other factors, exact equality down to every decimal may still not be possible if experiments run on different GPU types or hardware architectures. See discussion: https://github.com/ChEB-AI/python-chebai/issues/111
There are certain other factors which needs to be constant between two experiments one wants to compare. These factors might be based on the models, data configuration, training configuration. Below is the list abstract list of such factors:
- Model Hyperparameters
- Data Hyperparameters
- Training Hyperparameters
- Number of tokens / Vocabulary size and indexes of the tokens are not changed.