Skip to content

fix(reproducibility): add opt-in strict determinism across trainers#61

Open
SoheylM wants to merge 2 commits intomainfrom
codex/soh-14-strict-determinism-hardening
Open

fix(reproducibility): add opt-in strict determinism across trainers#61
SoheylM wants to merge 2 commits intomainfrom
codex/soh-14-strict-determinism-hardening

Conversation

@SoheylM
Copy link
Contributor

@SoheylM SoheylM commented Mar 5, 2026

Description

Adds an opt-in strict reproducibility path for all training entrypoints while preserving current default behavior.

  • Introduces engiopt/reproducibility.py with shared helpers:
    • seed_training(seed)
    • enable_strict_determinism(warn_only=True)
    • make_dataloader_generator(seed)
  • Adds strict_determinism: bool = False to all targeted training Args dataclasses.
  • Replaces ad-hoc seeding with seed_training(args.seed) in each training script.
  • Enables strict deterministic controls only when --strict-determinism is passed.
  • Hardens DataLoader reproducibility by supplying seeded generators for shuffle=True loaders.
  • Updates README with optional reproducibility usage (--strict-determinism).

Fixes SOH-14 (Linear)

Type of change

  • Documentation only change (no code changed)
  • Bug fix (non-breaking change which fixes an issue)
  • New algorithm (non-breaking change which adds a new model/algorithm)
  • Improvement to existing algorithm (non-breaking change which improves functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Screenshots

N/A

Checklist:

Code Quality

  • I have run the pre-commit checks with pre-commit run --all-files
  • I have run ruff check . and ruff format
  • I have run mypy .
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings

CleanRL Philosophy (for new/modified algorithms)

  • The implementation follows the CleanRL single-file philosophy: all training logic is contained in one file
  • The code is reproducible: random seeds are set, PyTorch determinism is enabled
  • Hyperparameters are configurable via command-line arguments using tyro
  • WandB logging is integrated with --track flag support
  • The model can be saved and restored via WandB artifacts (--save-model flag)

Algorithm Completeness (for new algorithms)

  • Both training script (algorithm.py) and evaluation script (evaluate_algorithm.py) are provided
  • The algorithm works with EngiBench's Problem interface
  • The algorithm is added to the README table with correct metadata

Documentation

  • I have made corresponding changes to the documentation
  • New algorithms include docstrings explaining the approach and any paper references

Validation

  • Determinism smoke check (cgan_cnn_2d) run twice on same machine with strict mode and fixed seed; resulting checkpoint tensor hashes matched for both generator and discriminator.
  • Non-strict short run completed successfully (default behavior path unchanged).
  • Strict mode configured with warn_only=True for nondeterministic ops to warn and continue.

@linear
Copy link

linear bot commented Mar 5, 2026

@SoheylM SoheylM requested a review from g-braeunlich March 5, 2026 09:45
Copy link
Collaborator

@g-braeunlich g-braeunlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just 2 small suggestions

Comment on lines +30 to +31
if torch.cuda.is_available():
torch.backends.cuda.matmul.allow_tf32 = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly the same logic, but maybe safer?

Suggested change
if torch.cuda.is_available():
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cuda.matmul.allow_tf32 = not torch.cuda.is_available()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept the current logic intentionally for strict mode semantics. In strict mode we should avoid enabling TF32; not torch.cuda.is_available() would set this flag to True on CPU/MPS runs. Current behavior only disables CUDA matmul TF32 when CUDA exists, while CPU/MPS remain unaffected. If we simplify, I’d still keep the assigned value False and only guard backend availability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up: I agree the cleaner framing is backend-capability based. For strict mode, we should keep TF32 disabled (False) whenever the CUDA matmul backend is present, rather than deriving the value from torch.cuda.is_available() (which can be False on CPU/MPS runs and would imply True). Happy to switch to that form for clarity if you prefer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants