Skip to content

Add optional checkpoint saving to Trainer class  #48

@blkdmr

Description

@blkdmr

The Trainer class in fenn/nn/trainers currently provides a clean training loop abstraction, but it does not offer a built-in way to save model checkpoints during training.

Checkpointing is a common requirement for:

  • long training runs
  • recovering from interruptions
  • inspecting intermediate models
  • selecting the best-performing epoch

This issue proposes adding a simple, optional checkpoint saving mechanism to the Trainer, without changing default behavior.

Goal

Allow users to optionally enable checkpoint saving by specifying:

  • which epochs should be saved
  • a base name for the checkpoint files

If not enabled, the Trainer should behave exactly as it does now.

Proposed behavior

Extend the Trainer configuration with optional parameters such as:

  • checkpoint_epochs: int | None
    • epochs at which a checkpoint is saved (e.g. 5, 10, 20)
  • checkpoint_name: str | None
    • base filename used for checkpoints

When enabled:

  • at the end of a specified epoch, the trainer saves:
    • model state (state_dict)
    • optimizer state (if available)
    • current epoch index

Example filenames:

  • checkpoint_name_epoch_10.pt
  • checkpoint_name_epoch_20.pt

Tasks

  1. Inspect the current Trainer implementation in fenn/nn/trainers.
  2. Add checkpoint-related arguments (constructor or config.yaml-based).
  3. Implement checkpoint saving at the end of training epochs.
  4. Ensure:
    • no checkpoints are written unless explicitly enabled
    • existing training workflows remain unaffected
  5. Add minimal documentation or docstring explaining usage.

Acceptance criteria

  • Checkpoint saving is fully optional.
  • Users can control:
    • which epochs are saved
    • the checkpoint file base name
  • Trainer behavior is unchanged when checkpointing is disabled.
  • Saved checkpoints can be reloaded with standard PyTorch APIs.
  • Code remains simple and readable.

Optional (but nice to have)

  • resume-from-checkpoint logic

How to contribute

Comment on this issue to claim it, then open a PR with:

  • the implementation
  • a short usage example (code snippet or docstring)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions