Moved calculation of mse to common_step and added logging of mean rmse#5
Moved calculation of mse to common_step and added logging of mean rmse#5mfroelund wants to merge 7 commits intodmidk:dev/gefionfrom
Conversation
|
LGTM. |
|
Nice, thanks for the link. Have you added comments to the comments in this PR? I don't see the comments then:( |
|
@mafdmi Could it be possible to also log the learning rate during training?
|
i did, but I don't know why you don't see them. Anyways, I just wrote that the comments were maybe redundant. :) |
|
I've added logging of learning rate, so I think this is ready for final review. You can see the results of my test run at https://localhost:4433/#/experiments/7/runs/6f7eacff595a4515b1892029c368052b/model-metrics |
|
LGTM! :) |
|
@matschreiner I've fixed the tests, but for some reason two of the workflow jobs haven't been run - they have been waiting for a runner to pick them up since yesterday. Tried to re-run them today: https://github.com/mafdmi/neural-lam/actions/runs/13700752255 |
|
Sorry @mafdmi I added a lot of comments. Nice that we are logging all the metrics now. I had hoped that we could find a common metric to compare models across runs with different variables. |
I don't get it, but I still don't see your comments! |
neural_lam/models/ar_model.py
Outdated
| Train on single batch | ||
| """ | ||
| prediction, target, pred_std, _ = self.common_step(batch) | ||
| prediction, target, pred_std, _, entry_mses = self.common_step(batch) |
There was a problem hiding this comment.
I think the common step should have a more descriptive name - it should reflect it's function rather than the fact that it is being shared.
Also it has two responsibilities which is prediction with the model and processsing of the prediction - could maybe be factored into separate steps?
There was a problem hiding this comment.
Agree, maybe somebody else can contribute here, since I don't know enough to properly phrase it.
There was a problem hiding this comment.
The problem is that this function is actually unpacking the batch and performing the unroll prediction step. Then it returns some elements of the unpacked batch and the calculation, then calculate metrics, which is an unclear responsibility that is hard to name and write a good docstring for and that is what happened. :D A cleaner implementation I would say is to just unpack the batch in the steps and perform the prediction.
So basically remove the common_step function entirely and replace it by
(init_states, target_states, forcing_features, batch_times) = batch
prediction, pred_std = self.unroll_prediction(init_states, forcing_features, target_states)
entry_mses = ....
neural_lam/models/ar_model.py
Outdated
| # Logging | ||
| train_log_dict = {"train_loss": batch_loss} | ||
| state_var_names = self._datastore.get_vars_names(category="state") | ||
| train_log_dict |= { |
There was a problem hiding this comment.
This code is hard to read. Maybe wrap it in a function with a descriptive name? I would imagine that the train_log_dict should be defined in one line, something like
train_log_dict = {"train_loss": batch_loss, "lr": ..., **rmse_dict}
There was a problem hiding this comment.
Not sure if it is more readable now, but I made it into one dict. I haven't put it into its own function, since I thought it is part of the training_step or validation_step to log. But we can do that if you think it'd be better.
neural_lam/models/ar_model.py
Outdated
| f"train_rmse_{v}": mean_rmse_ar_step_1[i] | ||
| for (i, v) in enumerate(state_var_names) | ||
| } | ||
| train_log_dict["train_lr"] = self.trainer.optimizers[0].param_groups[0][ |
There was a problem hiding this comment.
isn't learning rate only related to training since it's called "train_lr"?
There was a problem hiding this comment.
I'm not sure what you mean.. actually I'm unsure if we have learning rate for test and validation steps? If not, yes, then let's simply call it "lr"
@leifdenby Do you know why the gpu tests fail in above action? Is it okay to merge without those tests passing? |
|
@mafdmi I had a last comment, otherwise LGTM |
|
How does it look now to you? I've removed the common_steps function now. Makes sense I think:) |
cherry-picked from dmidk#5
Describe your changes
Logging rmse during training, validation and testing.
Issue Link
< Link to the relevant issue or task. > (e.g.
closes #00orsolves #00)Type of change
Checklist before requesting a review
pullwith--rebaseoption if possible).Checklist for reviewers
Each PR comes with its own improvements and flaws. The reviewer should check the following:
Author checklist after completed review
reflecting type of change (add section where missing):
Checklist for assignee