[EXPERIMENT]: Validate checkpoint reproducibility across identical training runs

Context

During recent discussion in Discord there was a question about whether
checkpoint hashing alone is sufficient for verifying the training process.

Before building more complex verification layers, it would be useful to
validate a basic assumption:

If two identical training runs are executed with the same dataset,
configuration, and seed, do they produce identical model checkpoints?

Proposed Experiment

Create a small reproducible experiment that:

1. Uses a tiny dataset (e.g., small Wikipedia subset or synthetic data)
2. Runs a deterministic training loop twice
3. Saves checkpoints from both runs
4. Computes SHA-256 hashes of both checkpoints
5. Verifies whether the hashes match

Expected Outcome

If hashes match, checkpoint hashing may be sufficient for verifying
training determinism under controlled conditions.

If hashes differ, it indicates hidden sources of entropy in the
training pipeline.

Implementation

The experiment would live under:

experiments/checkpoint_reproducibility/

and would include:

- a minimal deterministic training script
- checkpoint hashing
- a simple reproducibility report

This experiment can serve as the first validation for the
training determinism layer of the project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[EXPERIMENT]: Validate checkpoint reproducibility across identical training runs #62

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[EXPERIMENT]: Validate checkpoint reproducibility across identical training runs #62

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions