Moved calculation of mse to common_step and added logging of mean rmse by mfroelund · Pull Request #5 · dmidk/neural-lam

mfroelund · 2025-03-05T11:04:18Z

Describe your changes

Logging rmse during training, validation and testing.

Issue Link

< Link to the relevant issue or task. > (e.g. closes #00 or solves #00)

Type of change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
I have performed a self-review of my code
For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
I have updated the README to cover introduced code changes
I have added tests that prove my fix is effective or that my feature works
I have given the PR a name that clearly describes the change, written in imperative form (context).
I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee.

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

the code is readable
the code is well tested
the code is documented (including return types and parameters)
the code is easy to maintain

Author checklist after completed review

I have added a line to the CHANGELOG describing this change, in a section
reflecting type of change (add section where missing):
- added: when you have added new functionality
- changed: when default behaviour of the code has been changed
- fixes: when your contribution fixes a bug

Checklist for assignee

PR is up to date with the base branch
the tests pass
author has added an entry to the changelog (and designated the change as added, changed or fixed)
Once the PR is ready to be merged, squash commits and merge the PR.

matschreiner · 2025-03-05T12:09:44Z

LGTM.
Just added comments about comments.
Maybe its a question about personal style, but still, here's a nice recap of the chapter about comments from the famous book "Clean Code" by uncle bob :)
https://medium.com/codex/clean-code-comments-833e11a706dc

mfroelund · 2025-03-05T12:28:20Z

Nice, thanks for the link. Have you added comments to the comments in this PR? I don't see the comments then:(

matschreiner · 2025-03-05T13:06:19Z

@mafdmi Could it be possible to also log the learning rate during training?
I think you can find it on

model.trainer.optimizer.param_groups[0]["lr"]

matschreiner · 2025-03-05T13:14:39Z

Nice, thanks for the link. Have you added comments to the comments in this PR? I don't see the comments then:(

i did, but I don't know why you don't see them. Anyways, I just wrote that the comments were maybe redundant. :)

mfroelund · 2025-03-06T10:37:42Z

I've added logging of learning rate, so I think this is ready for final review. You can see the results of my test run at https://localhost:4433/#/experiments/7/runs/6f7eacff595a4515b1892029c368052b/model-metrics

matschreiner · 2025-03-06T10:51:19Z

LGTM! :)

mfroelund · 2025-03-07T13:36:04Z

@matschreiner I've fixed the tests, but for some reason two of the workflow jobs haven't been run - they have been waiting for a runner to pick them up since yesterday. Tried to re-run them today: https://github.com/mafdmi/neural-lam/actions/runs/13700752255

matschreiner · 2025-03-10T09:14:47Z

Sorry @mafdmi I added a lot of comments.

Nice that we are logging all the metrics now. I had hoped that we could find a common metric to compare models across runs with different variables.

mfroelund · 2025-03-10T09:48:14Z

Sorry @mafdmi I added a lot of comments.

Nice that we are logging all the metrics now. I had hoped that we could find a common metric to compare models across runs with different variables.

I don't get it, but I still don't see your comments!

neural_lam/models/ar_model.py

matschreiner · 2025-03-10T09:00:29Z

neural_lam/models/ar_model.py

        Train on single batch
        """
-        prediction, target, pred_std, _ = self.common_step(batch)
+        prediction, target, pred_std, _, entry_mses = self.common_step(batch)


I think the common step should have a more descriptive name - it should reflect it's function rather than the fact that it is being shared.
Also it has two responsibilities which is prediction with the model and processsing of the prediction - could maybe be factored into separate steps?

Agree, maybe somebody else can contribute here, since I don't know enough to properly phrase it.

The problem is that this function is actually unpacking the batch and performing the unroll prediction step. Then it returns some elements of the unpacked batch and the calculation, then calculate metrics, which is an unclear responsibility that is hard to name and write a good docstring for and that is what happened. :D A cleaner implementation I would say is to just unpack the batch in the steps and perform the prediction.

So basically remove the common_step function entirely and replace it by

(init_states, target_states, forcing_features, batch_times) = batch prediction, pred_std = self.unroll_prediction(init_states, forcing_features, target_states) entry_mses = ....

matschreiner · 2025-03-10T09:01:34Z

neural_lam/models/ar_model.py

+        # Logging
+        train_log_dict = {"train_loss": batch_loss}
+        state_var_names = self._datastore.get_vars_names(category="state")
+        train_log_dict |= {


This code is hard to read. Maybe wrap it in a function with a descriptive name? I would imagine that the train_log_dict should be defined in one line, something like

train_log_dict = {"train_loss": batch_loss, "lr": ..., **rmse_dict}

Not sure if it is more readable now, but I made it into one dict. I haven't put it into its own function, since I thought it is part of the training_step or validation_step to log. But we can do that if you think it'd be better.

matschreiner · 2025-03-10T09:04:37Z

neural_lam/models/ar_model.py

+            f"train_rmse_{v}": mean_rmse_ar_step_1[i]
+            for (i, v) in enumerate(state_var_names)
+        }
+        train_log_dict["train_lr"] = self.trainer.optimizers[0].param_groups[0][


isn't learning rate only related to training since it's called "train_lr"?

I'm not sure what you mean.. actually I'm unsure if we have learning rate for test and validation steps? If not, yes, then let's simply call it "lr"

neural_lam/models/ar_model.py

mfroelund · 2025-03-10T10:04:50Z

@matschreiner I've fixed the tests, but for some reason two of the workflow jobs haven't been run - they have been waiting for a runner to pick them up since yesterday. Tried to re-run them today: https://github.com/mafdmi/neural-lam/actions/runs/13700752255

@leifdenby Do you know why the gpu tests fail in above action? Is it okay to merge without those tests passing?

matschreiner · 2025-03-11T09:56:50Z

@mafdmi I had a last comment, otherwise LGTM

mfroelund · 2025-03-11T13:36:21Z

How does it look now to you? I've removed the common_steps function now. Makes sense I think:)

cherry-picked from dmidk#5

Moved calculation of mse to common_step and added logging of mean rmse

ed4b729

mfroelund added 2 commits March 6, 2025 11:12

Added logging of learning rate

7e50f49

Linting

49b02cb

Updated changelog

09ba710

Mock out trainer

0b20c02

matschreiner reviewed Mar 10, 2025

View reviewed changes

Edits after review

61d04e7

Removed common_step function. Made test_dict definition more clear.

be469ed

observingClouds added a commit to observingClouds/neural-lam that referenced this pull request Oct 8, 2025

add variable specific RMSE during training

98481bb

cherry-picked from dmidk#5

Comments

Conversation

mfroelund commented Mar 5, 2025

Describe your changes

Issue Link

Type of change

Checklist before requesting a review

Checklist for reviewers

Author checklist after completed review

Checklist for assignee

Uh oh!

matschreiner commented Mar 5, 2025

Uh oh!

mfroelund commented Mar 5, 2025

Uh oh!

matschreiner commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matschreiner commented Mar 5, 2025

Uh oh!

mfroelund commented Mar 6, 2025

Uh oh!

matschreiner commented Mar 6, 2025

Uh oh!

mfroelund commented Mar 7, 2025

Uh oh!

matschreiner commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mfroelund commented Mar 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

matschreiner Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

mfroelund Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

matschreiner Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matschreiner Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

mfroelund Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

matschreiner Mar 10, 2025

Choose a reason for hiding this comment

Uh oh!

mfroelund Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mfroelund commented Mar 10, 2025

Uh oh!

matschreiner commented Mar 11, 2025

Uh oh!

mfroelund commented Mar 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

matschreiner commented Mar 5, 2025 •

edited

Loading

matschreiner commented Mar 10, 2025 •

edited

Loading

matschreiner Mar 11, 2025 •

edited

Loading