Skip to content

Reproducibility Caveats

Aditya Khedekar edited this page Dec 6, 2025 · 5 revisions

To guarantee that our experiments and results can be reproduced reliably, we take several measures. Exact bit-level reproducibility depends on multiple factors, each of which is addressed in our pipeline:

1. Data Version

Reproducibility begins with ensuring that all experiments use the same ChEBI data version. Use:

--data.chebi_version=<version_number>

This explicitly fixes the dataset version and prevents discrepancies caused by upstream changes.

2. Data Splits

To reproduce identical train/validation/test splits, we fix the split seed at 42 by default. Override via:

--data.dynamic_data_split_seed=<seed>

This ensures consistent sampling across different runs or environments.

3. Model Weight Initialization

Model initialization randomness is controlled by PyTorch Lightning’s seed_everything, which is defaulted to 0:

--seed_everything=<seed>

This guarantees that model parameters start from the same initial values for every run.

4. Deterministic Model Training

Achieving true reproducibility also depends on making the underlying computational operations deterministic. For reference, see:

Lightning wraps these low-level settings under a single argument:

trainer(deterministic=True)

We already enable this by default (see PR #101: https://github.com/ChEB-AI/python-chebai/pull/101).

5. Numerical Precision

Even with deterministic settings, results may differ if different numerical precisions are used (e.g., float16 vs float32). We use 32-bit precision (float32) which is the default precision. Lightning precision docs: https://lightning.ai/docs/pytorch/stable/common/precision_basic.html

To change precision:

--trainer.precision=16-mixed

6. Hardware Differences

Even after controlling all other factors, exact equality down to every decimal may still not be possible if experiments run on different GPU types or hardware architectures. See discussion: https://github.com/ChEB-AI/python-chebai/issues/111

7. Other Factors

There are certain other factors which needs to be constant between two experiments one wants to compare. These factors might be based on the models, data configuration, training configuration. Below is the list abstract list of such factors:

  • Model Hyperparameters
  • Data Hyperparameters
  • Training Hyperparameters
  • Number of tokens / Vocabulary size and indexes of the tokens are not changed.

Clone this wiki locally