Skip to content

[EXPERIMENT]: Validate checkpoint reproducibility across identical training runs #62

@ryoari

Description

@ryoari

Context

During recent discussion in Discord there was a question about whether
checkpoint hashing alone is sufficient for verifying the training process.

Before building more complex verification layers, it would be useful to
validate a basic assumption:

If two identical training runs are executed with the same dataset,
configuration, and seed, do they produce identical model checkpoints?

Proposed Experiment

Create a small reproducible experiment that:

  1. Uses a tiny dataset (e.g., small Wikipedia subset or synthetic data)
  2. Runs a deterministic training loop twice
  3. Saves checkpoints from both runs
  4. Computes SHA-256 hashes of both checkpoints
  5. Verifies whether the hashes match

Expected Outcome

If hashes match, checkpoint hashing may be sufficient for verifying
training determinism under controlled conditions.

If hashes differ, it indicates hidden sources of entropy in the
training pipeline.

Implementation

The experiment would live under:

experiments/checkpoint_reproducibility/

and would include:

  • a minimal deterministic training script
  • checkpoint hashing
  • a simple reproducibility report

This experiment can serve as the first validation for the
training determinism layer of the project.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions