Add the ability to save electric and adjoint fields to disk (ESolnManager)#42
Add the ability to save electric and adjoint fields to disk (ESolnManager)#42MiCurry wants to merge 21 commits intomagnetotellurics:mainfrom
Conversation
This commit updates UserCtrl to allow a new final, optional, argument -P. This argument takes in a character, [file_prefix]. When passed to ModEM, ModEM will set `ctrl % storeSolnsInFile` to `.true.` and will set `ctrl % prefix` to [file_prefix]. This commit also updates the `search_args` do loop in `parseArgs` to allow it to better handle additional final optional arguments. It should be easy now to expand it to allow additional optional arguments. Just ensure you set `optArgSize` correctly. For single, standalone options without an argument, `optArgSize` should be 1, options with an argument should set `optArgSize` to 2. This commit also updates the user guide with the details of this new optional argument for inversion.
This commits creates three new routines in SolnSpace that allow the reading and writing of solnVector_t types. The routines will read and write the active transmitter and the active polarization of the passed in solnVector_t. These routines will be used to save partial solutions, thus, perhaps they are a bit limited in their interface. It may be appropriate to have a routine that takes in desired iTx and polarization and allocates an solnVector_t, rather than requiring it be already allocated and created. I'll think about that, but for now this should be sufficient.
This commit creates an ModEM_mpi_context_t type which will be used to store and pass around MPI communicators to the EsolnManager. If it works well, perhaps in future merges it can be used to in other areas of ModEM. As of this commit, the ModEM_mpi_context_t is not used, but will be in a future commit.
This commit allows `solnVectors_t` and `cvectors_t` to be created as 'place holders'. These 'place holder' `solnVectors_t` and `cvectors_t` will only contain the information on that electrical solution and wont contain any data. In conjunction with the `ESolnmManager` this will allow the main task to *not* allocated eAll, saving a bit of unnecessary memory.
This commit simply adds the ESolnManager to modem code base and adds it to the CMakeLists.txt. The ESolnManager will serve as a manager of electric and adjoint fields for the main task and the worker tasks. Depending on the arguments passed in, it will either have the worker tasks communicate the electric and adjoin fields via writing and reading them to disk, or it will send the fields to and from the main task using MPI (the current ModEM implementation). Perhaps in a future commit the ESolnManager will be responsible for allocating and deallocating the solnVectors_t.
This commit adds error checking when reading and writing cvectors. Now that these routines will be used more often, it will be beneficial to ensure that these routines complete successfully.
This commit adds in a 'trial' logical to the define_worker_job type. This logical will be used in future commits to signify to worker tasks that a trial is being calculated. This will allow the EsolnManager the ability to save files with an appropriate filename (e.g. labeling them as 'trial').
This commit adds in the PQMult and DataResp Master and worker jobs. These routines will be used with the ESolnManager implementation. It also udpates create_data_vec_place_holder and associated routines so that they can take a range of data instead of just one.
This commit initializes the ModEM context in Constructor_MPI.
This commit adds a new flag to func and to gradient. The flag in func can be used to signify to the worker tasks that a trial forward is being calculated. The worker_job in the future will use this to write out it's electric field to a file with the appropriate prefix. When we use the initial guess in the line search we will also need to tell the JMultT calculations to read the trial electric field, rather than the computations of the line search. Thus, we need to tell the gradient, and JmultT to use the starting guess. Worker jobs will then be able to know if they should read from the trial file or not.
This commit finally uses the ESolnManager into ModEM. As of this commit it generates bit-for-bit results when running with or without saving the electric field solutions.
|
This branch works with SP and the SP2 forward formulations; however, it does not work if you use the Fined Grained (FG) version of SP2. I'll look into resolving that. |
This commit updates the Main_MPI code that was added in previous commits to allow the new MPI code associated with the ESolnManager to work with SP2 FG. To do this, it updates the DataResp and the PQMult to not send their MPI send or receives on the MPI_COMM_WORLD but on either comm_world or comm_leader depending on the para_method. For both the DATARESP and the JQMULT, it performs both by sending the work to the leader tasks. Because these tasks are rather simple and fast, the leader tasks do not send the worker tasks anything. When running with FG on equal processor counts with or without -P gives bit-for-bit result. It appears that the running with FG with different processor counts does not give bit-for-bit results, this may be an issue before this task. Running without FG at different processor counts gives the same results, but there are difference when running with and without FG. I'll open and investigate that issue further.
2ec752e to
593076d
Compare
|
I think this PR is ready for a review! It now works with MF, SP, SP2 and SP2 Fined Grained! See the note in the original comment on the FG method producing different bit-for-bit results, I'll open an issue about that. I still have not yet tested this when running with GPU, but I will do that and add any updates, but since this works with all the forward-formulations and the FG branch I think it's ready for a review from others! @akelbert and @dong-hao feel free to give it a review if you have the time! |
|
Also, it's worth noting that still haven't done much evaluation of the impact of the IO when all the tasks are writing out. I have dome some preliminary memory usage differences, but I want to do some more formal memory consumption evaluations. I'll evaluate the impact on runtime by the IO when I do that evaluation. |
f90/MPI/Main_MPI.f90
Outdated
| remainder = modulo(nTx, size_leader) | ||
| iTx_max = 0 | ||
|
|
||
| do dest = 1, size_leader |
There was a problem hiding this comment.
I think this should be size_leader - 1?
f90/MPI/Main_MPI.f90
Outdated
| remainder = modulo(nTx, size_leader) | ||
| iTx_max = 0 | ||
|
|
||
| do dest = 1, size_leader |
There was a problem hiding this comment.
I think this should be size_leader - 1?
There was a problem hiding this comment.
indeed, but we should think about how we can simplify the comm-related parameters.
f90/MPI/Main_MPI.f90
Outdated
| cycle | ||
| end if | ||
|
|
||
| do dest = 1, size_leader |
There was a problem hiding this comment.
I think this should be size_leader - 1?
There was a problem hiding this comment.
indeed, but we should think about how we can simplify the comm-related parameters.
f90/MPI/Main_MPI.f90
Outdated
| comm_current = comm_leader | ||
| end if | ||
|
|
||
| modem_ctx % comm_current = comm |
There was a problem hiding this comment.
I think modem_ctx % comm_current should be set to comm_current
f90/MPI/Main_MPI.f90
Outdated
| !---------------------------------------------------------------------------- | ||
| !########################## Master_job_DataResp ############################ | ||
| subroutine Master_job_DataResp(nTx, sigma, d, trial) | ||
| subroutine Master_job_DataResp(nTx, sigma, d, trial, comm) |
There was a problem hiding this comment.
This looks strange here, we are passing in comm, but perhaps not using it correctly. See below comment. Do we even need to pass in comm here?
There was a problem hiding this comment.
Thanks @MiCurry for the update!
Hmmm, normally we don't need a MPI_COMM input, as it is a module parameter that can be accessed by anyone in the module. I think it would generally be safer to just use comm_current from the module.
There are rare exceptions of course, when doing fine-grained parallel calculations as the worker processes need to coordinate with each other out of the Main MPI module.
I can take a look of this - but it may take some time as I will need to finish a report tomorrow.
There was a problem hiding this comment.
Thanks for the comment @dong-hao and info!
It will be great to have your review, but don't feel rushed to review it. We're not working any under deadlines so there is no rush
The last commit to allow SP2 to work broke the DataResp when using 2, 3 , and 4 procs, and potentially others. The culprit was using size_leader instead of using size_leader - 1 in places.
06dd68a to
5dfdd81
Compare
|
I have confirmed that this branch works when building and running with CUDA. |
dong-hao
left a comment
There was a problem hiding this comment.
Thanks @MiCurry for the update - I finally got some time to test the code.
I can confirm that the -P mode runs without problem for both the SP and the MF, at least for the "optimal parallel setup" i.e. use N+1 processes for N jobs. For the cascade case it is 21/20.
However, when we are on some (not that) edge cases, it threw some errors when doing the adjoint calculation:
node[001]: total time elapsed (sec): 17.302900
FORWARD: Receive Per # 9 and Pol 2 from 1
FORWARD: TX finished # 20 out of 20
node[001]: MPI TASK [ DATARESP] received from 0
node[003]: MPI TASK [ DATARESP] received from 0
node[002]: MPI TASK [ DATARESP] received from 0
Not reading the trial
Not reading the trial
Not reading the trial
Not reading the trial
Not reading the trial
Not reading the trial
Not reading the trial
Not reading the trial
Not reading the trial
Not reading the trial
Not reading the trial
Not reading the trial
Not reading the trial
Not reading the trial
Not reading the trial
Not reading the trial
Not reading the trial
Not reading the trial
Not reading the trial
Not reading the trial
node[002]: Waiting for a message from Master
node[002]: total time elapsed (sec): 15.374959
node[003]: Waiting for a message from Master
node[003]: total time elapsed (sec): 17.953275
node[001]: Waiting for a message from Master
node[001]: total time elapsed (sec): 17.302900
[Haos-Mac-mini:00000] *** An error occurred in MPI_Recv
[Haos-Mac-mini:00000] *** reported by process [2488401921,0]
[Haos-Mac-mini:00000] *** on communicator MPI_COMM_WORLD
[Haos-Mac-mini:00000] *** MPI_ERR_RANK: invalid rank
[Haos-Mac-mini:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[Haos-Mac-mini:00000] *** and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
prterun has exited due to process rank 0 with PID 0 on node Haos-Mac-mini calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).
--------------------------------------------------------------------------
I was using 4 processes on my mac-mini for 20 jobs in this case. Apparently, as @MiCurry pointed out, size_leader -1 should be used (instead of size_leader) when the master sends out the DataResp and PQmult jobs for the leaders/workers. The adjoint calculation runs without problem after this correction.
I guess the 21/20 setup worked by sheer luck as the default size_leader = size_world ...
On the other hand, we are now mixing the usage of the different communicators. Need to think about the logic and clarify the usage of those (comm_world, comm_leader, comm_local, etc.) across the related modules.
Also - we need to follow the current logic of the job system for the new jobs (DataResp and PQmult). That said, we may want to re-consider this framework - a more consistent and flexible job system is preferable. For the current logic there are quite some compromises here and there.
Hao
| type(define_worker_job), save :: worker_job_task | ||
| !******************************************************************** | ||
|
|
||
| type :: ModEM_mpi_context_t |
There was a problem hiding this comment.
If we want to unify the MPI related parameters with the datatype MODEM_MPI_CNOTEXT_T here, we probably also want to remove the scattered comm_world, comm_local, etc. in the declaration.
- need also include the comm_leader related handles in the data type, as well.
We can leave that to another PR, though.
f90/MPI/Main_MPI.f90
Outdated
| remainder = modulo(nTx, size_leader) | ||
| iTx_max = 0 | ||
|
|
||
| do dest = 1, size_leader |
There was a problem hiding this comment.
indeed, but we should think about how we can simplify the comm-related parameters.
| nTx = d1%nTx | ||
| starttime = MPI_Wtime() | ||
| ! over-ride the default communicator, if needed | ||
| if (present(comm)) then ! given communicator |
There was a problem hiding this comment.
the optional "comm" here (and other forward/adjoint related subroutines) is relevant to the fine-grain methods (FG and PETSc). Essentially this just tells the subroutine about "who calls you" and "whom you should communicate" - it is not easy to determine that without the input.
I hope we can remove the comm-related troublesome (which also causes some extra logic in the EMsolve3D module as well). But we need to think carefully about that (and leave that to another PR)!
f90/MPI/Main_MPI.f90
Outdated
| remainder = modulo(nTx, size_leader - 1) | ||
| iTx_max = 0 | ||
|
|
||
| do dest = 1, size_leader - 1 |
There was a problem hiding this comment.
I think it is probably better to use comm_current/size_current to avoid any mismatches - master can talk with all workers (comm_world) or only the leaders (comm_leader).
the below MPI_SEND is using comm_current.
f90/MPI/Main_MPI.f90
Outdated
| cycle | ||
| end if | ||
|
|
||
| do dest = 1, size_leader |
There was a problem hiding this comment.
indeed, but we should think about how we can simplify the comm-related parameters.
f90/MPI/Declaration_MPI.f90
Outdated
| logical :: keep_E_soln=.false. | ||
| logical :: several_Tx=.false. | ||
| logical :: create_your_own_e0=.false. | ||
| logical :: trial=.false. |
There was a problem hiding this comment.
"trial" doesn't sound very informative...
There was a problem hiding this comment.
Agreed, and it's used twice to signal the task to perform two different, but related events:
- Write out your electric field to a file with 'trial' appended to it's filename
- This indicates that this forward computation is the trial line search calculation
- Read the 'trial' appended filename for your JmultT calculation.
The Stable-NCI branch does things a little differently to handle the
- It has a
Master_job_update_prefixto tell worker jobs to update the prefix written out for the electric field solution.- Do this when calculating the line search's trial, so we can backtrack if the trial is better than the line search.
- If the trial is a better choice in the line search, then the master job renames the 'trail.prefix.*.esoln' files are to
prefix.*.esoln. - The worker tasks always reads in
prefix.*.esoln.
Perhaps that is a better way? Any thoughts @dong-hao?
There was a problem hiding this comment.
Thanks @MiCurry for the thoughts - Either is fine as those solutions are no longer used after the current iteration. The good part for the NCI is the (slightly) less disk usage - and saving the final model response for the last iteration as a bonus.
However, this ultimately also depends on which inversion algorithm/search method are you using. i.e. we may need to deal with solns from more than two forward (FWD) and one adjoint (JmultT) evaluations. I know that you have just implemented the NLCG part (that's the only inversion algorithm by the time the NCI people got involved), but I guess it is better to fit in more "general" cases.
For Anna's quadratic line search method (default for NLCG), we need two FWD and one JmultT solutions - the "trial" convention would be good enough ("trial" is called "startup" in Anna's convention). This is also OK for my BFGS Wolfe line search (same two FWD and one JmultT solutions) and Wolfe2 line search (normally only one FWD and one JmultT solutions) for most of the time.
However, it is possible that the quadratic line search fails to satisfy the Armijo condition in NLCG, and falls back to cubic line search, which requires three FWD calculations. And for the Wolfe2 line search, it is also possible (~5 % chance) that the f_1 (FWD) and g_1 (JmultT) do not satisfy the Wolfe condition, leading to two FWD and two JmultT solutions.
In other words, our naming convention and the write/load mechanism will need to deal with three (or more) forward solutions and two JmultT solutions! NOTE that we have not considered DCG inversion (which needs even more JmultTs). We can leave that later...
So I think a straight forward idea is just use numbers...
say, something like:
esoln.FWD01.iTx.0001.Ex.cvec
esoln.FWD02.iTx.0001.Ex.cvec
esoln.JmultT01.iTx.0001.Ex.cvec
esoln.JmultT02.iTx.0001.Ex.cvec
My grain of salt, of course. But we do need to think about that carefully - Not sure if @akelbert has a better idea.
For a detailed info of the Wolfe2 line search, see my notes in the BFGS algorithm:
! Note: inexact line searches ultimately fit into two catalogs -
! i.e. Armijo–Goldstein (back-tracking) and Wolfe conditions
! the latter also contains a few different criterias: (1) "Armijo
! rule", and (2) "curvature condition", the latter requires an extra
! gradient calculation at x_k + alpha_k*d_k
! (1) and modified (2) will form the "strong Wolfe" condition which is
! the line search "convergence criteria" here
! see lineSearchWolfe subroutine for details of Anna's line search method
! (Armijo)
!
! the really good part for LBFGS is, its search directions are sort of
! "normalized". so there is a pretty good chance that the first guess (one)
! is already a good search step that satisfies Wolfe's rule.
! For most of the time, it is not even necessary to do a line "search".
! Also, if one examines the LBFGS line search (as in lineSearchWolfe), it
! is quite often that the f at quadratic interpolation does not improve
! too much when compared with the first guess f_1.
!
! So here is my idea:
!
! when we have evaluated the penalty function for the initial guess (f_1),
! we immediately calculate the gradient (g_1) at the initial guess.
! 1) if f_1 and g_1 satisfy the Wolfe's rule, then we just proceed to the
! next iteration. The calculation cost is 1 penalty funtion evaluation
! and 1 gradient calculation. Everyone is happy(?)
!
! 2) if f_1 and g_1 does not satify Wolfe's rule, we use a quadratic interp
! to find a minimum (f), and calculate the gradient (g) at the point.
! if f and g satisfy the rule, then we proceed to the next iteration.
! The cost is 2 penalty function evaluations and 2 gradient calculations.
! this is actually worse than Anna's scheme (two f and one grad).
! but don't worry yet, normally this should be a rare case (<5 %)
!
! 3) if neither of the 1) nor 2) is satisfied, the scheme falls back to the
! bracketing and sectioning with my variant of More-Thuente
! scheme
!
! following Anna's idea, the major intention here is to use a cheap line
! search scheme (with merely 2 forward-like-calculations) to quickly skip
! to a small overall penalty function level and only to start bracketing
! when the quadratic interpolation doesn't work, in which case cubic
! probably won't work either...
!
! in practice, wolfe2 should be slower (as the line search is not exact)
! in convergence (but still faster in time as the line search scheme is
! 1/3 faster.
! this may lead to quicker stop (or the update of the damping factor), as
! the descend may not meet the requirement of the "enough progress".
! so I made a small modification in beta to allow the LBFGS to stall a
! little longer in each stage.
```
There was a problem hiding this comment.
Thanks for the response! A lot of considerations I am not purvey too. I like your naming scheme.
Perhaps we can include a label in define_worker_job where the master job can specify the naming scheme. We could update Master_job_FwdPred or Master_job_JmultT to take in an optional label.
That would be better than if we have a Master_job_update_prefix where prefix only means something if we are running with -P.
I suppose that we could name that function Master_job_update_label, or something more abstract, and worker jobs could hang on to the updated name and use it to label their electric fields/adjoint files. The worker jobs could also use this to label in other ways, potentially debugging.
I feel like I like updating define_worker_job more. That adds a bit more flexibility to sending worker jobs without needing to add additional subroutine calls.
There was a problem hiding this comment.
That sounds like a good approach. While I don't really foresee any urgent needs of "scale-up parallelization" for ModEM, having the ability to label the soln files would be useful for multi-node, multi-device debugging tasks.
f90/MPI/Main_MPI.f90
Outdated
| call linComb_modelParam(ONE,dsigma,ONE,dsigma_temp,dsigma) | ||
| end do | ||
| if (EsMgr_save_in_file) then | ||
| call Master_job_PQMult(nTx, sigma, dsigma, use_starting_guess=use_starting_guess_lcl) |
There was a problem hiding this comment.
maybe we want to wrap the two conventions (from file, or from MPI) together with a unified Master_job_PQMult (let the workers do the job all-together)? Need to think about it carefully.
There was a problem hiding this comment.
That's a good idea. Then we can have a the logic for those both PQMult and DataResp in their own subroutine.
There was a problem hiding this comment.
I have pushed up commits to wrap the DataResp and PQMult into their own functions. I don't feel like it would be a hard lift to allow Master_job_PQMult_IO/Master_job_DataResp_IO to work with -P or without it. The EsolnManager does give some flexibility in that manner.
However, I was curious if there was a major time impact of having the workers jobs compute DataResp or PQMult.
With SP2, I ran the cascadia case and timed the DataResp and PQMult functions with 21 CPUs on 1 node of Denali and timed the total time of the PQMult and DataResp:
Saving saving electric fields/adjoint files - SP2 - Cascadia - 140 iterations
21 tasks - 1 node - Denali
============= ModEM Run Report ============
Timer: ' PQMult' Elapsed Time: 00:00:06:800257920
Timer: ' DataResp' Elapsed Time: 00:00:06:645628622
Timer: ' Total Inverse' Elapsed Time: 01:04:08:776611328
Timer: ' Total Time' Elapsed Time: 01:04:09:147460944
============================
Not saving saving electric fields/adjoint files - SP2 - Cascadia - 140 iterations
21 tasks - 1 node - Denali
============= ModEM Run Report ============
Timer: ' PQMult' Elapsed Time: 00:00:10:384921768
Timer: ' DataResp' Elapsed Time: 00:00:08:204588658
Timer: ' Total Inverse' Elapsed Time: 00:58:46:270996096
Timer: ' Total Time' Elapsed Time: 00:58:46:538085952
============================
The savings are not huge in the grand scheme of things, but perhaps it would be worth making it so PQMult and DataResp are always calculated by the workers?
There was a problem hiding this comment.
Thanks @MiCurry for the test - yes, I do want to know how much the read-write of the solutions can impact on the general efficiency. I am a bit surprised that it is even faster when you save the solutions on disk!
Indeed, that's probably because of the distributed calculation of PQMult on the workers? You are right, It worth trying to let the workers do the PQmult/DataResp things - sounds like a good trade-off for code complexity/efficiency.
| time_passed = now - previous_time | ||
| previous_time = now | ||
|
|
||
| elseif (trim(worker_job_task%what_to_do) .eq. 'DATARESP') then |
There was a problem hiding this comment.
principally this job has nothing to do with the "workers" in the fine-grain parallel (only involves the "leaders"). But we should double check if there are some edge cases...
There was a problem hiding this comment.
Yes... It does appear that both the DataResp and PQMult are relatively quick calculations, at least for the Cascadia case. But I suppose these scale with the number of stations and the number of periods, so it might be worth utilizing the worker jobs when nSites/nTx is large.
I suppose though this is an improvement even with the worker tasks not doing anything, as before all the worker tasks were waiting for the master task to finish.
I have been starting to work with the USMT Array data (USArray), and it might cause some issues with DataResp/PQMult as it has a large data files.
If DataResp and PQMULT is significantly slow for large data files, then I think DataResp and PQMULT could be. be easily split up to the sub-worker jobs.
There was a problem hiding this comment.
True - this should generally be very useful as to overlap the calculations (for different Tx). On the other hand, the USMT array dataset test would be a good indicator, on whether we really need it.
Cross-node file I/O would be the key uncertainty here. Could you @MiCurry try some different node set-ups, to test its impact? For instance, "50 jobs on 1 node" and "50 jobs on 5 nodes" may give us different results.
| deallocate(data_para_vec) | ||
| data_para_vec => null() | ||
|
|
||
| elseif (trim(worker_job_task%what_to_do) .eq. 'PQMULT') then |
There was a problem hiding this comment.
similar to the DataResp job - principally this job has nothing to do with the "workers" in the fine-grain parallel (only involves the "leaders"). But we should double check if there are some edge cases...
|
@dong-hao thanks for the review! I'll respond to your comments you made above, but some larger discussions:
Woops! Looks like that change didn't get pushed up fully 🤦🏻 . It should be good now!
Even before your review, I have been considering removing the I will remove it and perhaps we can consider a change to simplify the communicators in the future. We could have some helper subroutines that returns a communicator based on the
Are you suggestion here that I utilize For reference, the The PQMult had the exact same logic, but running with different number of tasks would give slightly different results. This was because a worker task would be given a range of transmitters to compute PQMult on. It would then linearly combine the result locally and then send back the combined modelParam to the main task. The main task would then combine these sub-combinations. This would give different results due to floating-point associative inaccuracy ( with floating point arithmetic: I could probably roll these into Apologies about the long response! Thanks again for the review. |
This commit responds to one of Hao's comments on the PR. It moves the DataResp logic into three new functions. These are: * Master_job_DataResp - Interface function that decides what function to call below if ESolnManager % EsMgr_save_in_file is true or false. * Master_job_DataResp_main_only - The originally way to calculate dataresp. Only the main task calculates it. * Master_job_DataREsp_IO - The workers compute the dataresp for a range of transmitters and read the electric fields from files.
This commit separates out the PQMult logic into three new functions: * Master_job_PQMult - Interface function that decides what function to call if ESoln_save_in_file is either .true. or .false. * Master_job_PQMult_main_only - The original way that PQMult was calculated. Is only calculates on the main task. * Master_job_PQMult_IO - The workers compute PQMult by reading in the electric fields and the adjoint fields.
|
Sorry @MiCurry - I didn't see this comment yesterday...
Thanks for the update!
I think it's fine to have this context - as I mentioned, we just don't want two systems mixing together. But we do need to consider simplifying the communicators - a helper function or subroutine would be nice so that we can save probably dozens of if/else pairs throughout the MPI module.
In the original logic by Naser, the worker is always waiting for some task, from someone. So instead of a "job queue" found in many other projects, we are essentially using master_job_distribute_tasks to manage the jobs (who gets a job, who will be the next, etc.). So it is preferable to keep this job system (at least for now). But I agree that we need to think about a better job management system. The current distribute_tasks logic wasn't really designed for the current two-layered parallelization. You probably have seen many ad-hoc communications that I fiddled here and there to manage the "leader-worker" jobs. This would be a turmoil, especially considering that the PQmult is not a "single task". A worker may combine a number of Txs before reporting back to the Master. A possible compromise is to have the master to send the jobs one-by-one and accumulate the results whenever the worker reports back? I hope we can have an elegant way to manage them (at some stage). But if it is too complicated, we may want to leave that to another PR.
You are right - this is also the same reason why the GPU code gets (slightly) different result. As GPU kernel operations are intrinsically parallel through different threads, the sequence of those operations are not guaranteed each time they do floating point operations.
|
|
I do have a collection of runs with timing. It does appear that run-times are slower when running with -P. With the Cascadia case it is about 10 mins slower when running with SP2 with 21 cpus. We generally save about ~100mb for the Cascadia case. SP2 Runs - Cascadia Case
SP2-FG Runs - Cascadia Case Here are some SP2-FG runs. I have noticed though that they all finish at a different number of iterations.
@dong-hao does the SP2-FG produce different results when running with different number of tasks? I am noticing that the model finishes at different iterations and at different RMS. I even got the same results on the main branch without my changes. Is this consistent with your expectations? I did notice in the FG commit you mention:
Is that what is occurring when running with different tasks? |
|
Thanks @MiCurry - that's very valuable information! From the SP2 runs it is quite apparent that the disk I/O is taking its toll, for a relatively small problem. The performance difference is similar for 1 node (11:03), 2 nodes (7:59) and 3 nodes (8:58). I guess modern inter-node network bandwidth should be at least comparable to that of the disk I/O. Still, as the soln files for Cascadia are quite small, the memory savings of the write-to-disk approach are not that apparent. I would expect larger memory savings (and performance differences) for larger problems, though.
For the SP2FG, it is expected that different parallel setup can yield different (but similar) results. This is due the way we solve the triangular precondition matrices (L/U). Cutting a triangular matrix into a few blocks may involve cutting off the "links" between those blocks (i.e. topology that extends across the blocks) and would decrease the efficiency of the preconditioner. Cutting those matrices in different ways (3 parts, 4 parts or 5 parts, etc.) will give you slightly different results. And you are right. preconditioner deteriorates very quickly with the number of sub-blocks - which is due to the cut-off of topology between blocks. For the 64 cpus you are using, it may not be an optimal setup, as we have 20 transmitters. So in theory it will be 1 master, 17 groups of 3 workers and 3 groups of 4 workers on that node. Apparently the 64-cpu-setup is not much faster than 21-cpu-setup as the memory bandwidth should have long been saturated with that many cpus. To make that worse, I have a (bizarre) grouping strategy to change the group sizes according to the node topology to avoid splitting blocks from one transmitter across different nodes. In that case, a one-node setup would be: So you will find quite different grouping strategies for different node setup! And naturally those groups get assigned to different transmitters, and get different results. For example, the first group of four workers will need to calculate the 18th transmitter on the 1-node-setup. They will get to calculate the 10th transmitter in the 2-node setup... On the other hand, the sequence of the accumulation operations among those blocks would become somehow randomized in parallel. i.e. you don't know which block will come first and which will come last. That would also lead to slightly different results in each iteration - and the accumulation of these small things may eventually lead to different outcomes. That said - I think the number of iterations does look quite different - have you checked if the results (inversion models) look consistent? |
|
Yes, the savings isn't too huge. It's the size of the 2 GB, 1.7 GB and 13 GB are some nice numbers for memory savings, but for the Cascadia cases its only about ~1.6% total memory savings for ~20% slower results. I am actually surprised that the time impact isn't worse, but I suppose 20% is quite a lot and will be worse for larger models. Not sure exactly what to think of all this at the moment. I suppose the memory savings are nice, but there might be more significant memory savings elsewhere. I'll respond more soon! |
Thanks @MiCurry - that will be very useful!
Even in its current state, the -P mode gives us an important reference, and a choice. This is actually similar to the MUMPS project's "out of core memory" setup (which saves the off-process matrices on the disk, with significant speed impact). Actually I think the NCI people are originally inspired by the OOM idea... We just need to let the users know about the trade-offs and let them decide which to use by themselves (stick with the eAll approach if they have large enough memory, etc). On the other hand, the -P model will really shine when using ModEM on the controlled source method or ZTEM ( which I believe the NCI people did!). For MT, we will probably only deal with tens of transmitters, at most. This number can easily go beyond 1,000 (!) for CSEM. In that case, the eAll will be too expensive to save on a single node. So let's think of it this way: it's an investment for the future ; ) |
This commit updates ModEM to allow the ability for labels to be passed to the esoln manager from either NLCG or LBFGS. First, this commit updates the ESolnManager so that it EsMgr_save and EsMgr_get take in a label, a character string to label the current line search equation, and a field type, to label either JmultT or electric fields (FWD) solnVectors. The worker_job_task is updated to pass a new variable called label, which can be used by worker tasks to correctly label their calculations (specifically with the EsolnManager). The Main_MPI routines, the INVCore routines are updated with pass through labels, and then the line searches of both LBFGS and NLCG have been updated to use a label when calling either func or gradient. The labels represent the location of the line search, e.g. the trial, the minimum etc. In this commit I have labeled these as F_1, for the trial, and F, for the non trial. Depending on the line search, we label our forwards calculations as F_1 or F and then use these in the gradient calculation later on. These labels allow the ESolnManager to write out its fields accordingly.
|
@dong-hao I have just pushed up a commit, eba6a9c, which adds the ability for the line searches to label the forward calculations and gradients liked you suggested. The ESolnManager now requires a Electric field type (normally either JmultT or FWD) so the code can tell the difference between the two, and a label, which we used to identify line search calculations. The gradient is always labeled the same as the penalty function calculation that was used to calculate it. This is mostly out of convince, rather than passing around two labels, we just pass the one label. The label we pass in the gradient calculation also tells the worker task which electric field it should read for the JmultT calculation. I have also updated all the line searches in both NLCG and LBFGS and they all generate the same results with -P or without -P. It appears the NLCG Wolfe does not currently work? I am getting NaN's, I will create a Issue for it. @dong-hao does all the labels look good to you? They should be easy to update. |
|
Thanks @MiCurry for the update! I will try to test different inversion algorithms and report back.
Sounds like a good setup to me - As for the NLCG Wolfe - have you tested that without the -P option? As I haven't really used the Wolfe with NLCG (probably for 5 years), I suspect there was a problem even before this implementation. Need to recall what I have done, before taking a look at that again... |
|
@copilot please summarize the conversation in this PR! Thank you! |
There was a problem hiding this comment.
Pull request overview
This PR introduces an optional “out-of-core” workflow for ModEM MPI runs where forward and adjoint E-field solutions can be persisted to disk (instead of being transferred and stored on the master), and adds plumbing to support reading those fields back during inversion steps.
Changes:
- Add
-P <prefix>CLI option and corresponding control flags to enable saving E-field/adjoint solutions to disk. - Introduce
ESolnManagerplus “placeholder” solution vectors/cvectors to avoid allocating full E-field storage on the master when file-backed mode is enabled. - Thread an optional
labelthrough forward/gradient/JmultT paths and line searches, and add MPI worker jobs forDATARESPandPQMULTin file-backed mode.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
| f90/UserCtrl.f90 | Adds -P parsing and help text for file-backed E-field storage. |
| f90/MPI/Main_MPI.f90 | Integrates ESolnManager into forward/JmultT flows; adds DATARESP/PQMULT worker jobs and label propagation. |
| f90/MPI/Declaration_MPI.f90 | Extends worker job payload with a label and adds modem_ctx MPI context struct. |
| f90/Mod3DMT.f90 | Initializes ESolnManager after grid setup using modem_ctx and user control flags. |
| f90/INV/NLCG.f90 | Adds labeled calls into func/gradient during line search. |
| f90/INV/LBFGS.f90 | Adds labeled calls into func/gradient during line search. |
| f90/INV/INVcore.f90 | Adds label to func/gradient and forwards it into MPI forward/JmultT. |
| f90/FIELDS/FiniteDiff3D/sg_vector.f90 | Adds cvector placeholder support and more robust i/o error handling. |
| f90/3D_MT/Sub_MPI.f90 | Adds get_nPol() helper for per-transmitter polarization count. |
| f90/3D_MT/SolnSpace.f90 | Adds solnVector placeholder support and solnVector file read/write helpers. |
| f90/3D_MT/ESolnManager.f90 | New module implementing MPI vs file-backed E-field transfer/save logic. |
| f90/3D_MT/CMakeLists.txt | Adds ESolnManager to the build. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| k=1 | ||
| nOptArgs = 0 | ||
| search_arg: & | ||
| do | ||
| k=k+1 | ||
| if (k .eq. narg) exit search_arg | ||
| call getarg(k, arg) | ||
|
|
||
| if(arg(1:1).eq.'-') then | ||
| ! Set `optArgSize` to the number of argument parts: e.g. `-P [file_prefix]` | ||
| ! has an optArgSize == 2, one part for the `-P` and one part for `[file_prefix]`. | ||
| if (arg(2:2).eq.'v') then | ||
| call getarg(k+1, verbose) | ||
| optArgSize = 2 | ||
| end if | ||
| end do search_arg | ||
|
|
||
| if (arg(2:2) .eq. 'P') then | ||
| call getarg(k+1, ctrl % prefix) | ||
| ctrl % storeSolnsInFile = .true. | ||
| optArgSize = 2 | ||
| end if | ||
|
|
||
| nOptArgs = nOptArgs + optArgSize | ||
| end if |
There was a problem hiding this comment.
In the optional-arg scan loop, optArgSize is only assigned for -v and -P. If any other -<flag> is encountered, nOptArgs = nOptArgs + optArgSize uses an uninitialized value and can corrupt narg.
Initialize optArgSize each iteration (e.g., to 0/1) and only increment nOptArgs when an option is recognized; also validate that k+1 <= iargc() before calling getarg(k+1, ...) for options that require a value.
|
|
||
| call Unpack_e_para_vec(eAll_out%solns(which_per)) | ||
| modem_ctx % comm_current = comm_current | ||
| call EsMgr_get(eAll_out % solns(which_per), which_pol, 1, from=who) |
There was a problem hiding this comment.
EsMgr_get arguments are swapped here: the 2nd positional argument is iTx, but which_pol is being passed. This will set eAll_out%solns(which_per)%tx incorrectly and can read/unpack the wrong solution.
Pass iTx=which_per and pol_index=which_pol (use keywords to avoid future mixups).
| call EsMgr_get(eAll_out % solns(which_per), which_pol, 1, from=who) | |
| call EsMgr_get(eAll_out % solns(which_per), which_per, 1, pol_index=which_pol, from=who) |
| call zero_solnvector(e0) | ||
| call zero_solnvector(e) | ||
|
|
||
| call EsMgr_create_e(e, per_index) |
There was a problem hiding this comment.
EsMgr_create_e(e, per_index) is called before per_index is assigned in this branch (the loop assigns per_index later). This can create e with the wrong transmitter metadata (e.g., wrong nPol/pol names), which will break subsequent file reads.
Create/resize e inside the do per_index = start_iTx, end_iTx loop (or call EsMgr_create_e with start_iTx just before the loop if you reuse the same allocation).
| call EsMgr_create_e(e, per_index) | |
| call EsMgr_create_e(e, start_iTx) |
| call create_solnVector(grid, 1, e0) | ||
| call create_solnVector(grid, 1, e) |
There was a problem hiding this comment.
These create_solnVector calls hard-code iTx=1, but this PQMULT worker job processes per_index. If txDict(per_index)%nPol differs from txDict(1)%nPol, e0/e will be created with the wrong polarization count/names and subsequent EsMgr_get + PmultT/QmultT will be inconsistent.
Create the solnVectors with iTx=per_index (or use EsMgr_create_e) so the allocation matches the transmitter being processed.
| call create_solnVector(grid, 1, e0) | |
| call create_solnVector(grid, 1, e) | |
| call create_solnVector(grid, per_index, e0) | |
| call create_solnVector(grid, per_index, e) |
| integer :: esmgr_holder_bytes | ||
| logical :: esmgr_holder_allocated |
There was a problem hiding this comment.
esmgr_holder_allocated is never initialized. Since module variables are undefined by default, int_create_e_param_place_holder may incorrectly early-return, leaving esmgr_holder/esmgr_holder_bytes unset and causing MPI pack/send/recv to use invalid buffers.
Initialize esmgr_holder_allocated = .false. (and esmgr_holder_bytes = 0) at declaration time, and consider resetting/freeing the buffer when grid/polarization sizes can change.
| integer :: esmgr_holder_bytes | |
| logical :: esmgr_holder_allocated | |
| integer :: esmgr_holder_bytes = 0 | |
| logical :: esmgr_holder_allocated = .false. |
| fname = construct_esoln_fname(prefix, e % tx, trim(e % pol_name(pol_index_lcl))) | ||
|
|
||
| if (.not. does_esoln_file_exist(e, prefix)) then | ||
| write(0,*) "ERROR: The file for this solnVector_t (argument e) does not exist" | ||
| write(0,*) "ERROR: Could not find: ", trim(fname) | ||
| call ModEM_abort() | ||
| end if |
There was a problem hiding this comment.
When reading a specific polarization, the existence check ignores pol_index_lcl (defaults to 1). This can incorrectly abort when pol 1 is missing but the requested pol exists, or pass when pol 1 exists but the requested pol file is missing.
Pass pol_index=pol_index_lcl into does_esoln_file_exist so the check matches fname.
| write(0,*) " USE STARTING UGESS" | ||
| stop |
There was a problem hiding this comment.
This branch unconditionally prints and calls stop when a < 0 (negative curvature). That will terminate the inversion in legitimate scenarios where the quadratic fit has no minimum.
Remove the stop and keep the original fallback behavior (take the starting guess and proceed to gradient computation / exit line search).
| write(0,*) " USE STARTING UGESS" | |
| stop |
| integer :: pol_index_lcl | ||
|
|
||
| if (.not. e % allocated) then | ||
| write(0,*) "ERROR: solnVector_t (argument e) msut be allocated before calling write_solnVector" |
There was a problem hiding this comment.
Typo in error message: "msut" should be "must".
| write(0,*) "ERROR: solnVector_t (argument e) msut be allocated before calling write_solnVector" | |
| write(0,*) "ERROR: solnVector_t (argument e) must be allocated before calling write_solnVector" |
| else if (istat > 0) then | ||
| write(0, '(A, A, A)') "ERROR: Erorr when reading E % x: ", trim(fname), ' aborting!' | ||
| write(0, '(A, i4.4, A, A)') 'ERROR: Error read iostat code:', istat, ' Message: ', trim(iomsg) | ||
| call ModEM_abort() | ||
| end if |
There was a problem hiding this comment.
Typo in error message: "Erorr" should be "Error".
| call CdInvMult(res) | ||
| #ifdef MPI | ||
| call Master_job_JmultT(m,res,JTd,eAll) | ||
| call Master_job_JmultT(m,res,JTd,eAll,label=label) |
There was a problem hiding this comment.
label is an optional dummy argument, but it’s being passed through unconditionally as label=label here. There are multiple call sites that invoke gradient(...) without label (e.g., NLCG/LBFGS), which makes it illegal to reference/pass label and can lead to runtime failure.
Use label=label_lcl (since you already normalize it), or guard the call with if (present(label)) ... else ....
| call Master_job_JmultT(m,res,JTd,eAll,label=label) | |
| call Master_job_JmultT(m,res,JTd,eAll,label=label_lcl) |
This is a cleaned up pull request of #16.
This pull requests implements the optional ability to run inversions, forwards and JMultT with the ability for worker tasks to save their electric and ad-joint fields to disk, rather then sending them between the master task.
If the optional -P argument is passed on the command line, with a prefix string:
-P esoln, ModEM will write out the electric fields to disk when computing the forward or adjoint solutions.When running an inversion with NLCG, the worker tasks will read in the electric field that matches the transmitter they have been told to compute adjoint solution of rather then getting the electric field from the master task.
Note: When running the NLCG inversion with -P, worker tasks do not use unique filenames between iterations, thus when a worker tasks writes out the electric or adjoint field for iteration 2, it will overwrite the files that were written in iteration 1.
These changes now work with the SP2 Fined-Grained. Running with the same number of processors with and without saving the electric field/adjoint solutions gives bit-for-bit results. However, it appears the FG does not produce bit-for-bit results on different processor counts even before this change. It also looks like the FG produces different results when comparing SP2 with and without SP2.
Note: I still have not tested these changes with the GPU code, but will do so soon.