Skip to content

Comments

Moved calculation of mse to common_step and added logging of mean rmse#5

Open
mfroelund wants to merge 7 commits intodmidk:dev/gefionfrom
mfroelund:feature/log-rmse-during-train-and-val
Open

Moved calculation of mse to common_step and added logging of mean rmse#5
mfroelund wants to merge 7 commits intodmidk:dev/gefionfrom
mfroelund:feature/log-rmse-during-train-and-val

Conversation

@mfroelund
Copy link

Describe your changes

Logging rmse during training, validation and testing.

Issue Link

< Link to the relevant issue or task. > (e.g. closes #00 or solves #00)

Type of change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

  • My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
  • I have performed a self-review of my code
  • For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
  • I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
  • I have updated the README to cover introduced code changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have given the PR a name that clearly describes the change, written in imperative form (context).
  • I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee.

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

  • the code is readable
  • the code is well tested
  • the code is documented (including return types and parameters)
  • the code is easy to maintain

Author checklist after completed review

  • I have added a line to the CHANGELOG describing this change, in a section
    reflecting type of change (add section where missing):
    • added: when you have added new functionality
    • changed: when default behaviour of the code has been changed
    • fixes: when your contribution fixes a bug

Checklist for assignee

  • PR is up to date with the base branch
  • the tests pass
  • author has added an entry to the changelog (and designated the change as added, changed or fixed)
  • Once the PR is ready to be merged, squash commits and merge the PR.

@matschreiner
Copy link

LGTM.
Just added comments about comments.
Maybe its a question about personal style, but still, here's a nice recap of the chapter about comments from the famous book "Clean Code" by uncle bob :)
https://medium.com/codex/clean-code-comments-833e11a706dc

@mfroelund
Copy link
Author

Nice, thanks for the link. Have you added comments to the comments in this PR? I don't see the comments then:(

@matschreiner
Copy link

matschreiner commented Mar 5, 2025

@mafdmi Could it be possible to also log the learning rate during training?
I think you can find it on

model.trainer.optimizer.param_groups[0]["lr"]

@matschreiner
Copy link

Nice, thanks for the link. Have you added comments to the comments in this PR? I don't see the comments then:(

i did, but I don't know why you don't see them. Anyways, I just wrote that the comments were maybe redundant. :)

@mfroelund
Copy link
Author

I've added logging of learning rate, so I think this is ready for final review. You can see the results of my test run at https://localhost:4433/#/experiments/7/runs/6f7eacff595a4515b1892029c368052b/model-metrics

@matschreiner
Copy link

LGTM! :)

@mfroelund
Copy link
Author

@matschreiner I've fixed the tests, but for some reason two of the workflow jobs haven't been run - they have been waiting for a runner to pick them up since yesterday. Tried to re-run them today: https://github.com/mafdmi/neural-lam/actions/runs/13700752255

@matschreiner
Copy link

matschreiner commented Mar 10, 2025

Sorry @mafdmi I added a lot of comments.

Nice that we are logging all the metrics now. I had hoped that we could find a common metric to compare models across runs with different variables.

@mfroelund
Copy link
Author

Sorry @mafdmi I added a lot of comments.

Nice that we are logging all the metrics now. I had hoped that we could find a common metric to compare models across runs with different variables.

I don't get it, but I still don't see your comments!

Train on single batch
"""
prediction, target, pred_std, _ = self.common_step(batch)
prediction, target, pred_std, _, entry_mses = self.common_step(batch)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the common step should have a more descriptive name - it should reflect it's function rather than the fact that it is being shared.
Also it has two responsibilities which is prediction with the model and processsing of the prediction - could maybe be factored into separate steps?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, maybe somebody else can contribute here, since I don't know enough to properly phrase it.

Copy link

@matschreiner matschreiner Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that this function is actually unpacking the batch and performing the unroll prediction step. Then it returns some elements of the unpacked batch and the calculation, then calculate metrics, which is an unclear responsibility that is hard to name and write a good docstring for and that is what happened. :D A cleaner implementation I would say is to just unpack the batch in the steps and perform the prediction.

So basically remove the common_step function entirely and replace it by

(init_states, target_states, forcing_features, batch_times) = batch
prediction, pred_std = self.unroll_prediction(init_states, forcing_features, target_states)  
entry_mses = .... 

# Logging
train_log_dict = {"train_loss": batch_loss}
state_var_names = self._datastore.get_vars_names(category="state")
train_log_dict |= {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is hard to read. Maybe wrap it in a function with a descriptive name? I would imagine that the train_log_dict should be defined in one line, something like

train_log_dict = {"train_loss": batch_loss, "lr": ..., **rmse_dict}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it is more readable now, but I made it into one dict. I haven't put it into its own function, since I thought it is part of the training_step or validation_step to log. But we can do that if you think it'd be better.

f"train_rmse_{v}": mean_rmse_ar_step_1[i]
for (i, v) in enumerate(state_var_names)
}
train_log_dict["train_lr"] = self.trainer.optimizers[0].param_groups[0][

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't learning rate only related to training since it's called "train_lr"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean.. actually I'm unsure if we have learning rate for test and validation steps? If not, yes, then let's simply call it "lr"

@mfroelund
Copy link
Author

@matschreiner I've fixed the tests, but for some reason two of the workflow jobs haven't been run - they have been waiting for a runner to pick them up since yesterday. Tried to re-run them today: https://github.com/mafdmi/neural-lam/actions/runs/13700752255

@leifdenby Do you know why the gpu tests fail in above action? Is it okay to merge without those tests passing?

@matschreiner
Copy link

@mafdmi I had a last comment, otherwise LGTM

@mfroelund
Copy link
Author

How does it look now to you? I've removed the common_steps function now. Makes sense I think:)

observingClouds added a commit to observingClouds/neural-lam that referenced this pull request Oct 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants