-
Notifications
You must be signed in to change notification settings - Fork 58
Lfric2lfric multi cpu #201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Lfric2lfric multi cpu #201
Conversation
|
You need to add yourself to the contributors file (alphabetical order) You say that you tested it to test that it works, how did you test it? As I recall you were using a input mesh file that had inter-grid maps. So did it not fail in the partitioning? Looks like you just changed the number of mpis used on existing tasks. The KGOs have changed, though I be suspicious of some of the checksums. There are a lot of differences, some only the last few digits, though some much larger. I think you need to provide some plots comparing 1 cpu vs multiple cpus to show that the fields are not significantly impacted. |
Added myself to the contributor list (added myself in a previous PR, but it has not been merged yet). I also fixed a detail in the new error message added to the code. I modified the existing lfric2lfric rose stem tests, so that three of the jobs run with multiple CPUS: ral_seuk-C32_lam_MG, clim_gal9-C24_C12, and oasis_clim_gal9-C24_C12. That is the reason why there are KGO changes. I am afraid the results change slightly with the partitioning. |
| help=In the lfric2lfric program, this variable indicates the | ||
| =mesh for which the described partitioning is taking place. | ||
| = | ||
| =The only valid values of this variable in lfric2lfric | ||
| =are 'source' to indicate the source grid, and 'destination' | ||
| =to indicate the destination grid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| help=In the lfric2lfric program, this variable indicates the | |
| =mesh for which the described partitioning is taking place. | |
| = | |
| =The only valid values of this variable in lfric2lfric | |
| =are 'source' to indicate the source grid, and 'destination' | |
| =to indicate the destination grid. | |
| help=The designation of the mesh to which this | |
| =partitioning profile is to be applied. | |
| = | |
| =Recognised designations are: | |
| = * 'source' | |
| = * 'destination' |
| 'Partitioning parameters for the source mesh were not found, '// & | ||
| 'please specify a partitioning namelist with mesh_target=source.' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 'Partitioning parameters for the source mesh were not found, '// & | |
| 'please specify a partitioning namelist with mesh_target=source.' | |
| 'Source mesh partitioning namelist (partitioning:source) not found.' |
If the code cannot find the required partitioning profile namelist, then that is all it should state, the code doesn't know why it can't be found, so it cannot suggest a solution. For example, if this error message was given for Marks problem, he would have checked the namelist file which was not at fault. This message also would have to be updated if a different key instance member was used.
The Rose metadata is enough for the user to be able to check their configuration settings.
|
|
||
| if (.not. configuration%namelist_exists('partitioning', 'destination')) then | ||
| write( log_scratch_space, '(A)' ) & | ||
| 'Partitioning parameters for the source mesh were not found, '// & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 'Partitioning parameters for the source mesh were not found, '// & | |
| 'Destination mesh partitioning namelist (partitioning:destination) not found.' |
As before, the code doesn't know why the namelist is not found. Also the error message still referred to the source mesh.
I think you should edit the contributors list anyway, as it will block this change going on in the event that this PR is ready to commit first. Please refer to my comments in the Scitech review. Please provide some evidence that there is not significant changes in the output (plots) |
|
I wouldn't be concerned about speed at this juncture, My comments where based of the KGO differences and attribution of them. You haven't shown what the 1 CPU Oasis regridded method produces or is that not possible? If the differences are in the long/lats please provide example magnitudes? or the differences? |
We looked in the past that the map and the oasis method gave the same results with one CPU, and this is still true although I did not include the plot. We find the same here with 4 CPUs. I obtain lon/lat differences if I use ncdiff to generate the differences, but when I look at the actual longitude and latitude values in the lfric2lfric output using 1 or multiple CPUs, they are actually the same. I do not know why ncdiff behaves like this. |
ncdiff appears to checks differences at a lower tolerance than just dumping the output. Again, please post the differences in the lat lon values (a representative sample) |
There are no differences in lat lon values when comparing the values directly. There are some differences when comparing using ncdiff. I am confident that the lat lon values is not the problem. I suspect the problem is how lfric reads different types of data, I am preparing a test to check this theory. |
|
As requested, please post the differences in the values you have from ncdiff. |
Taken directly from ncdiff: Mesh2d_face_x = -0.7626276, -0.6977054, -0.6327863, -0.5678701, -0.502957, Mesh2d_face_y = 51.98213, 51.98115, 51.98013, 51.97908, 51.97799, 51.97687, |
|
No, that's not it, have you used nccmp? I want to see the differences in the lat lon values in the output. and similarly in the global mesh files you are using as input. |
So far I only used ncdiff. Differences with nccmp are, for the regional model: Variable Group Count Sum AbsSum Min Max Range Mean StdDev and for the global model: Variable Group Count Sum AbsSum Min Max Range Mean StdDev |
Sorry Juan, |
|
The output for 1 and multiple CPUs are on VDI: global to global 1 CPU 6 CPUs regional to regional 1 CPU 4 CPUs |





PR Summary
Sci/Tech Reviewer: @mo-rickywong
Code Reviewer: @TeranIvy
Change the type of the partitioning key instance member for lfric2lfric, so that lfric2lfric can run with more than one CPU
Code Quality Checklist
Testing
No other tests where performed, just modified existing tests so that the new functionality could be tested. Therefore, there were KGO changes that were addressed.
trac.log
Test Suite Results - lfric_apps - test_lfric2lfric_multi_CPU/run1
Suite Information
Task Information
✅ succeeded tasks - 1456
Security Considerations
Performance Impact
AI Assistance and Attribution
Documentation
PSyclone Approval
Sci/Tech Review
Please detail how you tested the functionality? i.e. what modifications did you make to test it? doesn't seem like the rose-stem test should pass given the partitioning checks with the presence of inter-grid maps in the input mesh file.
KGOs changes are numerous and not all insignificant, please provide plots/evidence that output from multiple cpus is not significantly different from runs with multiple cpus (plots?).
Code Review