Add `TrainStepper` #755

jpdunc23 · 2026-01-23T21:37:59Z

Add TrainStepper which implements train_on_batch for training Stepper modules. This intermediate PR makes no changes to training configuration. In future PRs, we plan to remove all training-specific stepper config attributes from StepperConfig and use TrainStepperConfig at the top level of TrainConfig.

Changes:

Added StepperConfig.get_train_stepper_config, a temporary helper to create a TrainStepperConfig from the training-specific config attributes on StepperConfig.
Tests updated

mcgibbon

In this PR let's put back the existing StepperConfig and have it construct these new smaller configs, so that the coupled code doesn't have to change. Then you can remove it in the next PR updating the coupled code.

mcgibbon · 2026-01-23T21:46:50Z

fme/ace/stepper/single_module.py

-        n_ensemble: The number of ensemble members evaluated for each training
-            batch member. Default is 2 if the loss type is EnsembleLoss, otherwise
-            the default is 1. Must be 2 for EnsembleLoss to be valid.
        parameter_init: The parameter initialization configuration.


parameter_init is something we only do at the start of training, we should be able to move it.

Agreed and plan to handle in a future PR. I think this will require some significant refactors in fme.coupled.

I don't mind this being in another PR, but doesn't it "just work" in fme.coupled? We're keeping the APIs the coupled code is exposed to (the StepperConfig for example) stable in this PR.

If we do this refactor after this PR, let's make sure we do it before we change the config yaml API.

fme/ace/train/train_config.py

fme/core/generics/stepper.py

fme/ace/stepper/single_module.py

jpdunc23 · 2026-01-24T03:44:49Z

fme/ace/stepper/single_module.py

+        )
+
+
+class TrainStepper(


I originally had TrainStepperConfig and TrainStepper in a separate file, but decided to put here to reduce the diff somewhat.

Could move them in a follow-on PR if you like.

fme/ace/stepper/single_module.py

mcgibbon

Could you please do a once-over of the refactored objects, and make sure any methods or attributes that aren't used are deleted or made private?

fme/ace/stepper/single_module.py

fme/ace/stepper/test_single_module.py

fme/ace/train/train_config.py

fme/ace/stepper/single_module.py

mcgibbon · 2026-01-26T17:07:29Z

fme/ace/stepper/single_module.py

+        )
+
+
+class TrainStepper(


Could move them in a follow-on PR if you like.

fme/ace/stepper/single_module.py

jpdunc23 · 2026-01-29T19:59:09Z

fme/ace/stepper/single_module.py

+    def set_eval(self) -> None:
+        for module in self.modules:
+            module.eval()
+
+    def set_train(self) -> None:
+        for module in self.modules:
+            module.train()


These were previously implemented on TrainStepperABC. They are required by Stepper because CoupledStepper uses them.

nit: if you removed these from the Stepper, CoupledStepper and TrainStepper would still work, because their existing implementations of set_train and set_eval (from the generic class) access the .modules attribute on Stepper (which is what's really needed) through their own implementations of .modules (which is what the TrainStepperABC code was accessing).

However that version of the code is also a little confusing (which for example led to the impression this wouldn't work, or is giving me the false impression it would work, if I'm just wrong). I'm not sure which is better (since this one does expose more API, which might be confusing for different reasons).

I agree this should work, but I personally prefer this more verbose version. Lmk if you prefer I bring back the TrainStepperABC method, in which case I will make it @final (previously it was overridden by CoupledStepper).

jpdunc23 · 2026-01-29T21:54:02Z

Could you please do a once-over of the refactored objects, and make sure any methods or attributes that aren't used are deleted or made private?

I've done so. As mentioned in #767, Stepper.predict could be made private but we should discuss at the next ACE technical sync.

mcgibbon · 2026-01-30T18:39:09Z

fme/ace/stepper/single_module.py

-        n_ensemble: The number of ensemble members evaluated for each training
-            batch member. Default is 2 if the loss type is EnsembleLoss, otherwise
-            the default is 1. Must be 2 for EnsembleLoss to be valid.
        parameter_init: The parameter initialization configuration.


I don't mind this being in another PR, but doesn't it "just work" in fme.coupled? We're keeping the APIs the coupled code is exposed to (the StepperConfig for example) stable in this PR.

If we do this refactor after this PR, let's make sure we do it before we change the config yaml API.

fme/core/generics/stepper.py

mcgibbon · 2026-01-30T18:44:36Z

fme/ace/stepper/single_module.py

-        return self._loss_normalizer
-
    @property
    def loss_obj(self) -> StepLoss:


Issue: loss_obj is training-specific and should be on the training stepper - we don't need it for inference. The coupled code uses TrainStepper, so it should be agnostic to this choice.

Same with effective_loss_scaling.

If for some reason we really did want to keep this information with the inference stepper, it would go on the training history, since multiple losses can get used.

This is planned for a future PR. The coupled code uses Stepper, not TrainStepper. This is intentional, since CoupledStepper doesn't use TrainStepper.train_on_batch. For the time being I'm leaving it on Stepper to avoid breaking the current CoupledStepper implementation.

Tasks I have in mind for future PR(s) are:

Add CoupledTrainStepper, analogous to TrainStepper (started by you in Decouple coupled stepper training and inference #754)

Update the coupled TrainConfig to support direct configuration of each component's loss and parameter_init, rather than relying on these attributes from StepperConfig, as CoupledStepper currently does.

With this done, I plan to simultaneously remove loss_obj from Stepper and loss from StepperConfig.

There is a way to do 3 without first doing 1 and 2 by building each component's StepLoss directly from the loss: StepLossConfig attribute on StepperConfig, but I'd prefer to avoid this temporary workaround since it mostly won't survive in later stages of refactoring.

That said, task 2 will also involve building the component loss objects in the coupled stepper code, so if you feel strongly then I'm willing to give it a shot and hopefully some of the effort will be worthwhile later.

This is planned for a future PR. The coupled code uses Stepper, not TrainStepper.

I think this is the correct final state, but it’s not necessary for this PR and is getting in the way of making the changes needed in ace. The coupled code currently depends on the stepper in ace that supports training. I don’t think that should be changed in this PR. We should make that change after more of 1-3 is complete.

I'm trying to scope this out but meeting a lot of resistance. I don't see a simple way to have CoupledStepper use TrainStepper without significant refactors in fme.coupled that I think are better left to the next PR.

While it's true that CoupledStepper was using a "training" Stepper before, TrainStepper is a very different object from what Stepper was. On the other hand, aside from no longer having a train_on_batch method, Stepper is basically unchanged so I think CoupledStepper should still use it for the time being.

To do what you want then CoupledStepperConfig will have to already use TrainStepperConfig in this PR. This is planned but involves significant refactoring in fme.coupled. I think we should wait until a later PR to do this refactoring.

After looking at the code a bit I'm realizing this boils down to the choice to update the load_stepper function to return the more minimal inference-stepping class within this PR. I don't think at this point we should revert that change, so we can move forward with the way you had planned to do it.

You would have had more freedom to define the separation between the two classes in this PR if the load_stepper update weren't combined with it in the same PR (i.e. if load_stepper returned an object with the same API as it does in main), but we can move forward with the path you're currently on in this PR.

mcgibbon · 2026-01-30T19:00:33Z

fme/ace/stepper/single_module.py

+    def set_eval(self) -> None:
+        for module in self.modules:
+            module.eval()
+
+    def set_train(self) -> None:
+        for module in self.modules:
+            module.train()


nit: if you removed these from the Stepper, CoupledStepper and TrainStepper would still work, because their existing implementations of set_train and set_eval (from the generic class) access the .modules attribute on Stepper (which is what's really needed) through their own implementations of .modules (which is what the TrainStepperABC code was accessing).

However that version of the code is also a little confusing (which for example led to the impression this wouldn't work, or is giving me the false impression it would work, if I'm just wrong). I'm not sure which is better (since this one does expose more API, which might be confusing for different reasons).

fme/ace/stepper/single_module.py

mcgibbon

LGTM!

Add TrainStepper

d845aa4

mcgibbon reviewed Jan 23, 2026

View reviewed changes

jpdunc23 commented Jan 23, 2026

View reviewed changes

fme/ace/stepper/single_module.py Show resolved Hide resolved

jpdunc23 commented Jan 23, 2026

View reviewed changes

fme/ace/stepper/single_module.py Show resolved Hide resolved

Address review comments

fc729ac

jpdunc23 commented Jan 24, 2026

View reviewed changes

fme/ace/stepper/single_module.py Show resolved Hide resolved

jpdunc23 commented Jan 24, 2026

View reviewed changes

fme/ace/stepper/single_module.py Show resolved Hide resolved

jpdunc23 commented Jan 24, 2026

View reviewed changes

jpdunc23 added 3 commits January 23, 2026 19:54

Fix comment

038e067

Fix MultiCallStepConfig tests

ac761f4

Minor cleanup

0b4ddd3

jpdunc23 commented Jan 24, 2026

View reviewed changes

fme/ace/stepper/single_module.py Outdated Show resolved Hide resolved

Fix coupled test, cleanup

5ca0673

jpdunc23 commented Jan 24, 2026

View reviewed changes

fme/ace/stepper/single_module.py Show resolved Hide resolved

jpdunc23 marked this pull request as ready for review January 26, 2026 16:47

mcgibbon reviewed Jan 26, 2026

View reviewed changes

fme/ace/stepper/single_module.py Show resolved Hide resolved

mcgibbon reviewed Jan 26, 2026

View reviewed changes

jpdunc23 added 6 commits January 29, 2026 09:28

Add StepperABC

2534ad0

Merge branch 'main' of github.com:ai2cm/ace into refactor-stepper

f9cc39a

Revert change to TrainConfig docs

0404066

Make set_eval and set_train abstract methods

bbaff2f

Minimal StepperABC and TrainStepperABC APIs

e5fa688

fme/core/generics/train_stepper.py -> fme/core/generics/stepper.py

ad0853b

jpdunc23 commented Jan 29, 2026

View reviewed changes

fme/ace/stepper/single_module.py Show resolved Hide resolved

jpdunc23 commented Jan 29, 2026

View reviewed changes

jpdunc23 and others added 5 commits January 29, 2026 12:39

Remove StepperConfig.get_train_window_data_requirements

cc62232

Use Mock in place of Stepper where possible

337b941

Simplify a couple more tests

fd1c538

Merge branch 'main' into refactor-stepper

65ed27b

Remove additional unused public API from TrainStepper

b40aa45

mcgibbon reviewed Jan 30, 2026

View reviewed changes

jpdunc23 added 5 commits January 30, 2026 14:43

Make additional Stepper API public

8fd1498

Remove StepperABC

4cbbffa

Revert filename change

fea47fe

Remove TrainStepperConfig.get_n_forward_steps

5a4111a

Merge branch 'main' of github.com:ai2cm/ace into refactor-stepper

97a159f

mcgibbon approved these changes Feb 2, 2026

View reviewed changes

Merge branch 'main' into refactor-stepper

8f0e571

jpdunc23 enabled auto-merge (squash) February 2, 2026 23:47

jpdunc23 merged commit faabca9 into main Feb 3, 2026
7 checks passed

jpdunc23 deleted the refactor-stepper branch February 3, 2026 00:02

Add TrainStepper #755

Add TrainStepper #755

Uh oh!

Conversation

jpdunc23 commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcgibbon left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mcgibbon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpdunc23 Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpdunc23 commented Jan 29, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpdunc23 Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mcgibbon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Add `TrainStepper` #755

Add `TrainStepper` #755

jpdunc23 commented Jan 23, 2026 •

edited

Loading

jpdunc23 Feb 2, 2026 •

edited

Loading

jpdunc23 Jan 30, 2026 •

edited

Loading