Add the ability to save electric and adjoint fields to disk (ESolnManager) by MiCurry · Pull Request #42 · magnetotellurics/ModEM

MiCurry · 2026-03-04T23:25:40Z

This is a cleaned up pull request of #16.

This pull requests implements the optional ability to run inversions, forwards and JMultT with the ability for worker tasks to save their electric and ad-joint fields to disk, rather then sending them between the master task.

If the optional -P argument is passed on the command line, with a prefix string: -P esoln, ModEM will write out the electric fields to disk when computing the forward or adjoint solutions.

When running an inversion with NLCG, the worker tasks will read in the electric field that matches the transmitter they have been told to compute adjoint solution of rather then getting the electric field from the master task.

Note: When running the NLCG inversion with -P, worker tasks do not use unique filenames between iterations, thus when a worker tasks writes out the electric or adjoint field for iteration 2, it will overwrite the files that were written in iteration 1.

These changes now work with the SP2 Fined-Grained. Running with the same number of processors with and without saving the electric field/adjoint solutions gives bit-for-bit results. However, it appears the FG does not produce bit-for-bit results on different processor counts even before this change. It also looks like the FG produces different results when comparing SP2 with and without SP2.

Note: I still have not tested these changes with the GPU code, but will do so soon.

This commit updates UserCtrl to allow a new final, optional, argument -P. This argument takes in a character, [file_prefix]. When passed to ModEM, ModEM will set `ctrl % storeSolnsInFile` to `.true.` and will set `ctrl % prefix` to [file_prefix]. This commit also updates the `search_args` do loop in `parseArgs` to allow it to better handle additional final optional arguments. It should be easy now to expand it to allow additional optional arguments. Just ensure you set `optArgSize` correctly. For single, standalone options without an argument, `optArgSize` should be 1, options with an argument should set `optArgSize` to 2. This commit also updates the user guide with the details of this new optional argument for inversion.

This commits creates three new routines in SolnSpace that allow the reading and writing of solnVector_t types. The routines will read and write the active transmitter and the active polarization of the passed in solnVector_t. These routines will be used to save partial solutions, thus, perhaps they are a bit limited in their interface. It may be appropriate to have a routine that takes in desired iTx and polarization and allocates an solnVector_t, rather than requiring it be already allocated and created. I'll think about that, but for now this should be sufficient.

This commit creates an ModEM_mpi_context_t type which will be used to store and pass around MPI communicators to the EsolnManager. If it works well, perhaps in future merges it can be used to in other areas of ModEM. As of this commit, the ModEM_mpi_context_t is not used, but will be in a future commit.

This commit allows `solnVectors_t` and `cvectors_t` to be created as 'place holders'. These 'place holder' `solnVectors_t` and `cvectors_t` will only contain the information on that electrical solution and wont contain any data. In conjunction with the `ESolnmManager` this will allow the main task to *not* allocated eAll, saving a bit of unnecessary memory.

This commit simply adds the ESolnManager to modem code base and adds it to the CMakeLists.txt. The ESolnManager will serve as a manager of electric and adjoint fields for the main task and the worker tasks. Depending on the arguments passed in, it will either have the worker tasks communicate the electric and adjoin fields via writing and reading them to disk, or it will send the fields to and from the main task using MPI (the current ModEM implementation). Perhaps in a future commit the ESolnManager will be responsible for allocating and deallocating the solnVectors_t.

This commit adds error checking when reading and writing cvectors. Now that these routines will be used more often, it will be beneficial to ensure that these routines complete successfully.

This commit adds in a 'trial' logical to the define_worker_job type. This logical will be used in future commits to signify to worker tasks that a trial is being calculated. This will allow the EsolnManager the ability to save files with an appropriate filename (e.g. labeling them as 'trial').

This commit adds in the PQMult and DataResp Master and worker jobs. These routines will be used with the ESolnManager implementation. It also udpates create_data_vec_place_holder and associated routines so that they can take a range of data instead of just one.

This commit initializes the ModEM context in Constructor_MPI.

This commit adds a new flag to func and to gradient. The flag in func can be used to signify to the worker tasks that a trial forward is being calculated. The worker_job in the future will use this to write out it's electric field to a file with the appropriate prefix. When we use the initial guess in the line search we will also need to tell the JMultT calculations to read the trial electric field, rather than the computations of the line search. Thus, we need to tell the gradient, and JmultT to use the starting guess. Worker jobs will then be able to know if they should read from the trial file or not.

This commit finally uses the ESolnManager into ModEM. As of this commit it generates bit-for-bit results when running with or without saving the electric field solutions.

MiCurry · 2026-03-06T19:16:27Z

This branch works with SP and the SP2 forward formulations; however, it does not work if you use the Fined Grained (FG) version of SP2. I'll look into resolving that.

This commit updates the Main_MPI code that was added in previous commits to allow the new MPI code associated with the ESolnManager to work with SP2 FG. To do this, it updates the DataResp and the PQMult to not send their MPI send or receives on the MPI_COMM_WORLD but on either comm_world or comm_leader depending on the para_method. For both the DATARESP and the JQMULT, it performs both by sending the work to the leader tasks. Because these tasks are rather simple and fast, the leader tasks do not send the worker tasks anything. When running with FG on equal processor counts with or without -P gives bit-for-bit result. It appears that the running with FG with different processor counts does not give bit-for-bit results, this may be an issue before this task. Running without FG at different processor counts gives the same results, but there are difference when running with and without FG. I'll open and investigate that issue further.

MiCurry · 2026-03-09T17:05:26Z

I think this PR is ready for a review! It now works with MF, SP, SP2 and SP2 Fined Grained! See the note in the original comment on the FG method producing different bit-for-bit results, I'll open an issue about that.

I still have not yet tested this when running with GPU, but I will do that and add any updates, but since this works with all the forward-formulations and the FG branch I think it's ready for a review from others! @akelbert and @dong-hao feel free to give it a review if you have the time!

MiCurry · 2026-03-09T17:12:53Z

Also, it's worth noting that still haven't done much evaluation of the impact of the IO when all the tasks are writing out. I have dome some preliminary memory usage differences, but I want to do some more formal memory consumption evaluations. I'll evaluate the impact on runtime by the IO when I do that evaluation.

MiCurry · 2026-03-09T18:37:45Z

f90/MPI/Main_MPI.f90

+    remainder = modulo(nTx, size_leader)
+    iTx_max = 0
+
+    do dest = 1, size_leader


I think this should be size_leader - 1?

MiCurry · 2026-03-09T18:38:01Z

f90/MPI/Main_MPI.f90

+    remainder = modulo(nTx, size_leader)
+    iTx_max = 0
+
+    do dest = 1, size_leader


I think this should be size_leader - 1?

indeed, but we should think about how we can simplify the comm-related parameters.

MiCurry · 2026-03-09T18:38:12Z

f90/MPI/Main_MPI.f90

+                cycle
+            end if
+
+            do dest = 1, size_leader


I think this should be size_leader - 1?

indeed, but we should think about how we can simplify the comm-related parameters.

MiCurry · 2026-03-11T20:04:55Z

f90/MPI/Main_MPI.f90

+        comm_current = comm_leader
+    end if
+
+    modem_ctx % comm_current = comm


I think modem_ctx % comm_current should be set to comm_current

MiCurry · 2026-03-11T20:04:57Z

f90/MPI/Main_MPI.f90

 !----------------------------------------------------------------------------
 !##########################  Master_job_DataResp ############################
-subroutine Master_job_DataResp(nTx, sigma, d, trial)
+subroutine Master_job_DataResp(nTx, sigma, d, trial, comm)


This looks strange here, we are passing in comm, but perhaps not using it correctly. See below comment. Do we even need to pass in comm here?

Thanks @MiCurry for the update!

Hmmm, normally we don't need a MPI_COMM input, as it is a module parameter that can be accessed by anyone in the module. I think it would generally be safer to just use comm_current from the module.

There are rare exceptions of course, when doing fine-grained parallel calculations as the worker processes need to coordinate with each other out of the Main MPI module.

I can take a look of this - but it may take some time as I will need to finish a report tomorrow.

Thanks for the comment @dong-hao and info!

It will be great to have your review, but don't feel rushed to review it. We're not working any under deadlines so there is no rush

The last commit to allow SP2 to work broke the DataResp when using 2, 3 , and 4 procs, and potentially others. The culprit was using size_leader instead of using size_leader - 1 in places.

MiCurry · 2026-03-16T22:02:33Z

I have confirmed that this branch works when building and running with CUDA.

dong-hao

Thanks @MiCurry for the update - I finally got some time to test the code.

I can confirm that the -P mode runs without problem for both the SP and the MF, at least for the "optimal parallel setup" i.e. use N+1 processes for N jobs. For the cascade case it is 21/20.

However, when we are on some (not that) edge cases, it threw some errors when doing the adjoint calculation:

node[001]:    total time elapsed (sec):    17.302900
    FORWARD: Receive Per #     9 and Pol    2      from     1
    FORWARD: TX finished #    20 out of    20
 node[001]:    MPI TASK [                      DATARESP] received from     0
 node[003]:    MPI TASK [                      DATARESP] received from     0
 node[002]:    MPI TASK [                      DATARESP] received from     0
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
 node[002]:    Waiting for a message from Master
 node[002]:    total time elapsed (sec):    15.374959
 node[003]:    Waiting for a message from Master
 node[003]:    total time elapsed (sec):    17.953275
 node[001]:    Waiting for a message from Master
 node[001]:    total time elapsed (sec):    17.302900
 [Haos-Mac-mini:00000] *** An error occurred in MPI_Recv
 [Haos-Mac-mini:00000] *** reported by process [2488401921,0]
 [Haos-Mac-mini:00000] *** on communicator MPI_COMM_WORLD
 [Haos-Mac-mini:00000] *** MPI_ERR_RANK: invalid rank
 [Haos-Mac-mini:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
 [Haos-Mac-mini:00000] ***    and MPI will try to terminate your MPI job as well)
 --------------------------------------------------------------------------
 prterun has exited due to process rank 0 with PID 0 on node Haos-Mac-mini calling
 "abort". This may have caused other processes in the application to be
 terminated by signals sent by prterun (as reported here).
 --------------------------------------------------------------------------

I was using 4 processes on my mac-mini for 20 jobs in this case. Apparently, as @MiCurry pointed out, size_leader -1 should be used (instead of size_leader) when the master sends out the DataResp and PQmult jobs for the leaders/workers. The adjoint calculation runs without problem after this correction.

I guess the 21/20 setup worked by sheer luck as the default size_leader = size_world ...

On the other hand, we are now mixing the usage of the different communicators. Need to think about the logic and clarify the usage of those (comm_world, comm_leader, comm_local, etc.) across the related modules.

Also - we need to follow the current logic of the job system for the new jobs (DataResp and PQmult). That said, we may want to re-consider this framework - a more consistent and flexible job system is preferable. For the current logic there are quite some compromises here and there.

Hao

dong-hao · 2026-03-19T07:04:32Z

f90/MPI/Declaration_MPI.f90

 type(define_worker_job), save :: worker_job_task
 !********************************************************************

+type :: ModEM_mpi_context_t


If we want to unify the MPI related parameters with the datatype MODEM_MPI_CNOTEXT_T here, we probably also want to remove the scattered comm_world, comm_local, etc. in the declaration.

need also include the comm_leader related handles in the data type, as well.

We can leave that to another PR, though.

dong-hao · 2026-03-19T07:40:59Z

f90/MPI/Main_MPI.f90

+    remainder = modulo(nTx, size_leader)
+    iTx_max = 0
+
+    do dest = 1, size_leader


indeed, but we should think about how we can simplify the comm-related parameters.

dong-hao · 2026-03-19T07:42:58Z

f90/MPI/Main_MPI.f90

     nTx = d1%nTx
     starttime = MPI_Wtime()
     ! over-ride the default communicator, if needed
     if (present(comm)) then ! given communicator


the optional "comm" here (and other forward/adjoint related subroutines) is relevant to the fine-grain methods (FG and PETSc). Essentially this just tells the subroutine about "who calls you" and "whom you should communicate" - it is not easy to determine that without the input.
I hope we can remove the comm-related troublesome (which also causes some extra logic in the EMsolve3D module as well). But we need to think carefully about that (and leave that to another PR)!

dong-hao · 2026-03-19T07:48:55Z

f90/MPI/Main_MPI.f90

+    remainder = modulo(nTx, size_leader - 1)
+    iTx_max = 0
+
+    do dest = 1, size_leader - 1


I think it is probably better to use comm_current/size_current to avoid any mismatches - master can talk with all workers (comm_world) or only the leaders (comm_leader).
the below MPI_SEND is using comm_current.

dong-hao · 2026-03-19T07:55:25Z

f90/MPI/Main_MPI.f90

+                cycle
+            end if
+
+            do dest = 1, size_leader


indeed, but we should think about how we can simplify the comm-related parameters.

dong-hao · 2026-03-19T08:07:45Z

f90/MPI/Declaration_MPI.f90

     logical       :: keep_E_soln=.false.
     logical       :: several_Tx=.false.
     logical       :: create_your_own_e0=.false.
+     logical       :: trial=.false.


"trial" doesn't sound very informative...

Agreed, and it's used twice to signal the task to perform two different, but related events:

Write out your electric field to a file with 'trial' appended to it's filename

This indicates that this forward computation is the trial line search calculation

Read the 'trial' appended filename for your JmultT calculation.

The Stable-NCI branch does things a little differently to handle the

It has a Master_job_update_prefix to tell worker jobs to update the prefix written out for the electric field solution.

Do this when calculating the line search's trial, so we can backtrack if the trial is better than the line search.

If the trial is a better choice in the line search, then the master job renames the 'trail.prefix.*.esoln' files are to prefix.*.esoln.

The worker tasks always reads in prefix.*.esoln.

Perhaps that is a better way? Any thoughts @dong-hao?

Thanks @MiCurry for the thoughts - Either is fine as those solutions are no longer used after the current iteration. The good part for the NCI is the (slightly) less disk usage - and saving the final model response for the last iteration as a bonus.

However, this ultimately also depends on which inversion algorithm/search method are you using. i.e. we may need to deal with solns from more than two forward (FWD) and one adjoint (JmultT) evaluations. I know that you have just implemented the NLCG part (that's the only inversion algorithm by the time the NCI people got involved), but I guess it is better to fit in more "general" cases.

For Anna's quadratic line search method (default for NLCG), we need two FWD and one JmultT solutions - the "trial" convention would be good enough ("trial" is called "startup" in Anna's convention). This is also OK for my BFGS Wolfe line search (same two FWD and one JmultT solutions) and Wolfe2 line search (normally only one FWD and one JmultT solutions) for most of the time.

However, it is possible that the quadratic line search fails to satisfy the Armijo condition in NLCG, and falls back to cubic line search, which requires three FWD calculations. And for the Wolfe2 line search, it is also possible (~5 % chance) that the f_1 (FWD) and g_1 (JmultT) do not satisfy the Wolfe condition, leading to two FWD and two JmultT solutions.

In other words, our naming convention and the write/load mechanism will need to deal with three (or more) forward solutions and two JmultT solutions! NOTE that we have not considered DCG inversion (which needs even more JmultTs). We can leave that later...

So I think a straight forward idea is just use numbers...

say, something like:

esoln.FWD01.iTx.0001.Ex.cvec esoln.FWD02.iTx.0001.Ex.cvec esoln.JmultT01.iTx.0001.Ex.cvec esoln.JmultT02.iTx.0001.Ex.cvec

My grain of salt, of course. But we do need to think about that carefully - Not sure if @akelbert has a better idea.

For a detailed info of the Wolfe2 line search, see my notes in the BFGS algorithm:

! Note: inexact line searches ultimately fit into two catalogs - ! i.e. Armijo–Goldstein (back-tracking) and Wolfe conditions ! the latter also contains a few different criterias: (1) "Armijo ! rule", and (2) "curvature condition", the latter requires an extra ! gradient calculation at x_k + alpha_k*d_k ! (1) and modified (2) will form the "strong Wolfe" condition which is ! the line search "convergence criteria" here ! see lineSearchWolfe subroutine for details of Anna's line search method ! (Armijo) ! ! the really good part for LBFGS is, its search directions are sort of ! "normalized". so there is a pretty good chance that the first guess (one) ! is already a good search step that satisfies Wolfe's rule. ! For most of the time, it is not even necessary to do a line "search". ! Also, if one examines the LBFGS line search (as in lineSearchWolfe), it ! is quite often that the f at quadratic interpolation does not improve ! too much when compared with the first guess f_1. ! ! So here is my idea: ! ! when we have evaluated the penalty function for the initial guess (f_1), ! we immediately calculate the gradient (g_1) at the initial guess. ! 1) if f_1 and g_1 satisfy the Wolfe's rule, then we just proceed to the ! next iteration. The calculation cost is 1 penalty funtion evaluation ! and 1 gradient calculation. Everyone is happy(?) ! ! 2) if f_1 and g_1 does not satify Wolfe's rule, we use a quadratic interp ! to find a minimum (f), and calculate the gradient (g) at the point. ! if f and g satisfy the rule, then we proceed to the next iteration. ! The cost is 2 penalty function evaluations and 2 gradient calculations. ! this is actually worse than Anna's scheme (two f and one grad). ! but don't worry yet, normally this should be a rare case (<5 %) ! ! 3) if neither of the 1) nor 2) is satisfied, the scheme falls back to the ! bracketing and sectioning with my variant of More-Thuente ! scheme ! ! following Anna's idea, the major intention here is to use a cheap line ! search scheme (with merely 2 forward-like-calculations) to quickly skip ! to a small overall penalty function level and only to start bracketing ! when the quadratic interpolation doesn't work, in which case cubic ! probably won't work either... ! ! in practice, wolfe2 should be slower (as the line search is not exact) ! in convergence (but still faster in time as the line search scheme is ! 1/3 faster. ! this may lead to quicker stop (or the update of the damping factor), as ! the descend may not meet the requirement of the "enough progress". ! so I made a small modification in beta to allow the LBFGS to stall a ! little longer in each stage. ```

Thanks for the response! A lot of considerations I am not purvey too. I like your naming scheme.

Perhaps we can include a label in define_worker_job where the master job can specify the naming scheme. We could update Master_job_FwdPred or Master_job_JmultT to take in an optional label.

That would be better than if we have a Master_job_update_prefix where prefix only means something if we are running with -P.

I suppose that we could name that function Master_job_update_label, or something more abstract, and worker jobs could hang on to the updated name and use it to label their electric fields/adjoint files. The worker jobs could also use this to label in other ways, potentially debugging.

I feel like I like updating define_worker_job more. That adds a bit more flexibility to sending worker jobs without needing to add additional subroutine calls.

That sounds like a good approach. While I don't really foresee any urgent needs of "scale-up parallelization" for ModEM, having the ability to label the soln files would be useful for multi-node, multi-device debugging tasks.

dong-hao · 2026-03-20T03:51:35Z

f90/MPI/Main_MPI.f90

-         call linComb_modelParam(ONE,dsigma,ONE,dsigma_temp,dsigma)
-     end do
+     if (EsMgr_save_in_file) then
+         call Master_job_PQMult(nTx, sigma, dsigma, use_starting_guess=use_starting_guess_lcl)


maybe we want to wrap the two conventions (from file, or from MPI) together with a unified Master_job_PQMult (let the workers do the job all-together)? Need to think about it carefully.

That's a good idea. Then we can have a the logic for those both PQMult and DataResp in their own subroutine.

I have pushed up commits to wrap the DataResp and PQMult into their own functions. I don't feel like it would be a hard lift to allow Master_job_PQMult_IO/Master_job_DataResp_IO to work with -P or without it. The EsolnManager does give some flexibility in that manner.

However, I was curious if there was a major time impact of having the workers jobs compute DataResp or PQMult.

With SP2, I ran the cascadia case and timed the DataResp and PQMult functions with 21 CPUs on 1 node of Denali and timed the total time of the PQMult and DataResp:

Saving saving electric fields/adjoint files - SP2 - Cascadia - 140 iterations 21 tasks - 1 node - Denali ============= ModEM Run Report ============ Timer: ' PQMult' Elapsed Time: 00:00:06:800257920 Timer: ' DataResp' Elapsed Time: 00:00:06:645628622 Timer: ' Total Inverse' Elapsed Time: 01:04:08:776611328 Timer: ' Total Time' Elapsed Time: 01:04:09:147460944 ============================

Not saving saving electric fields/adjoint files - SP2 - Cascadia - 140 iterations 21 tasks - 1 node - Denali ============= ModEM Run Report ============ Timer: ' PQMult' Elapsed Time: 00:00:10:384921768 Timer: ' DataResp' Elapsed Time: 00:00:08:204588658 Timer: ' Total Inverse' Elapsed Time: 00:58:46:270996096 Timer: ' Total Time' Elapsed Time: 00:58:46:538085952 ============================

The savings are not huge in the grand scheme of things, but perhaps it would be worth making it so PQMult and DataResp are always calculated by the workers?

Thanks @MiCurry for the test - yes, I do want to know how much the read-write of the solutions can impact on the general efficiency. I am a bit surprised that it is even faster when you save the solutions on disk!

Indeed, that's probably because of the distributed calculation of PQMult on the workers? You are right, It worth trying to let the workers do the PQmult/DataResp things - sounds like a good trade-off for code complexity/efficiency.

dong-hao · 2026-03-20T04:07:26Z

f90/MPI/Main_MPI.f90

             time_passed =  now - previous_time
             previous_time = now

+         elseif (trim(worker_job_task%what_to_do) .eq. 'DATARESP') then


principally this job has nothing to do with the "workers" in the fine-grain parallel (only involves the "leaders"). But we should double check if there are some edge cases...

Yes... It does appear that both the DataResp and PQMult are relatively quick calculations, at least for the Cascadia case. But I suppose these scale with the number of stations and the number of periods, so it might be worth utilizing the worker jobs when nSites/nTx is large.

I suppose though this is an improvement even with the worker tasks not doing anything, as before all the worker tasks were waiting for the master task to finish.

I have been starting to work with the USMT Array data (USArray), and it might cause some issues with DataResp/PQMult as it has a large data files.

If DataResp and PQMULT is significantly slow for large data files, then I think DataResp and PQMULT could be. be easily split up to the sub-worker jobs.

True - this should generally be very useful as to overlap the calculations (for different Tx). On the other hand, the USMT array dataset test would be a good indicator, on whether we really need it.

Cross-node file I/O would be the key uncertainty here. Could you @MiCurry try some different node set-ups, to test its impact? For instance, "50 jobs on 1 node" and "50 jobs on 5 nodes" may give us different results.

dong-hao · 2026-03-20T04:16:37Z

f90/MPI/Main_MPI.f90

+             deallocate(data_para_vec)
+             data_para_vec => null()
+
+         elseif (trim(worker_job_task%what_to_do) .eq. 'PQMULT') then


similar to the DataResp job - principally this job has nothing to do with the "workers" in the fine-grain parallel (only involves the "leaders"). But we should double check if there are some edge cases...

MiCurry · 2026-03-20T18:36:02Z

@dong-hao thanks for the review! I'll respond to your comments you made above, but some larger discussions:

I was using 4 processes on my mac-mini for 20 jobs in this case. Apparently, as @MiCurry pointed out, size_leader -1 should be used (instead of size_leader) when the master sends out the DataResp and PQmult jobs for the leaders/workers. The adjoint calculation runs without problem after this correction.

Woops! Looks like that change didn't get pushed up fully 🤦🏻 . It should be good now!

If we want to unify the MPI related parameters with the datatype MODEM_MPI_CNOTEXT_T here, we probably also want to remove the scattered comm_world, comm_local, etc. in the declaration.

need also include the comm_leader related handles in the data type, as well.
We can leave that to another PR, though.

Even before your review, I have been considering removing the ModEM_mpi_context_t. It was one of the first things I came up with when implementing this feature, and now after working with the code for a while I don't really think it does what I intended it to do and just makes things a bit more and confusing.

I will remove it and perhaps we can consider a change to simplify the communicators in the future. We could have some helper subroutines that returns a communicator based on the para_method and the rank. All for a future change.

Also - we need to follow the current logic of the job system for the new jobs (DataResp and PQmult). That said, we may want to re-consider this framework - a more consistent and flexible job system is preferable. For the current logic there are quite some compromises here and there.

Are you suggestion here that I utilize Master_job_Distribute_Tasks instead of having DataResp and PQMult create thier own job systems? I would be willing to consider that.

For reference, the DataResp routine for passing jobs was copied from the work done in the Stable_NCI branch. I really don't like the logic of that routine, but I have kept it instead of rewriting it.

The PQMult had the exact same logic, but running with different number of tasks would give slightly different results. This was because a worker task would be given a range of transmitters to compute PQMult on. It would then linearly combine the result locally and then send back the combined modelParam to the main task. The main task would then combine these sub-combinations.

This would give different results due to floating-point associative inaccuracy ( with floating point arithmetic: (A + B) + C != A + (B + C)). Thus, in PQMult I created the new job system so the main task collects modelParams and then combines the model param in the same order.

I could probably roll these into Master_job_Distribute_Tasks. I also created the new job system in PQMult as Master_job_Distribute_Tasks makes my head hurt a bit 😅 (I really dislike how Distribute_Tasks has a two sends in it). I would hate to make Distribute_Tasks longer and more complicated, but I could try and find a good elegant way to fit DataResp and PQMult it into Distribute_Tasks if you find it a better idea.

Apologies about the long response! Thanks again for the review.

This commit responds to one of Hao's comments on the PR. It moves the DataResp logic into three new functions. These are: * Master_job_DataResp - Interface function that decides what function to call below if ESolnManager % EsMgr_save_in_file is true or false. * Master_job_DataResp_main_only - The originally way to calculate dataresp. Only the main task calculates it. * Master_job_DataREsp_IO - The workers compute the dataresp for a range of transmitters and read the electric fields from files.

This commit separates out the PQMult logic into three new functions: * Master_job_PQMult - Interface function that decides what function to call if ESoln_save_in_file is either .true. or .false. * Master_job_PQMult_main_only - The original way that PQMult was calculated. Is only calculates on the main task. * Master_job_PQMult_IO - The workers compute PQMult by reading in the electric fields and the adjoint fields.

dong-hao · 2026-03-25T03:45:50Z

Sorry @MiCurry - I didn't see this comment yesterday...

@dong-hao thanks for the review! I'll respond to your comments you made above, but some larger discussions:

I was using 4 processes on my mac-mini for 20 jobs in this case. Apparently, as @MiCurry pointed out, size_leader -1 should be used (instead of size_leader) when the master sends out the DataResp and PQmult jobs for the leaders/workers. The adjoint calculation runs without problem after this correction.

Woops! Looks like that change didn't get pushed up fully 🤦🏻 . It should be good now!

Thanks for the update!

If we want to unify the MPI related parameters with the datatype MODEM_MPI_CNOTEXT_T here, we probably also want to remove the scattered comm_world, comm_local, etc. in the declaration.
need also include the comm_leader related handles in the data type, as well.
We can leave that to another PR, though.

Even before your review, I have been considering removing the ModEM_mpi_context_t. It was one of the first things I came up with when implementing this feature, and now after working with the code for a while I don't really think it does what I intended it to do and just makes things a bit more and confusing.

I will remove it and perhaps we can consider a change to simplify the communicators in the future. We could have some helper subroutines that returns a communicator based on the para_method and the rank. All for a future change.

I think it's fine to have this context - as I mentioned, we just don't want two systems mixing together. But we do need to consider simplifying the communicators - a helper function or subroutine would be nice so that we can save probably dozens of if/else pairs throughout the MPI module.

Also - we need to follow the current logic of the job system for the new jobs (DataResp and PQmult). That said, we may want to re-consider this framework - a more consistent and flexible job system is preferable. For the current logic there are quite some compromises here and there.

Are you suggestion here that I utilize Master_job_Distribute_Tasks instead of having DataResp and PQMult create thier own job systems? I would be willing to consider that.

In the original logic by Naser, the worker is always waiting for some task, from someone. So instead of a "job queue" found in many other projects, we are essentially using master_job_distribute_tasks to manage the jobs (who gets a job, who will be the next, etc.). So it is preferable to keep this job system (at least for now).

But I agree that we need to think about a better job management system. The current distribute_tasks logic wasn't really designed for the current two-layered parallelization. You probably have seen many ad-hoc communications that I fiddled here and there to manage the "leader-worker" jobs.

This would be a turmoil, especially considering that the PQmult is not a "single task". A worker may combine a number of Txs before reporting back to the Master. A possible compromise is to have the master to send the jobs one-by-one and accumulate the results whenever the worker reports back? I hope we can have an elegant way to manage them (at some stage). But if it is too complicated, we may want to leave that to another PR.

For reference, the DataResp routine for passing jobs was copied from the work done in the Stable_NCI branch. I really don't like the logic of that routine, but I have kept it instead of rewriting it.

The PQMult had the exact same logic, but running with different number of tasks would give slightly different results. This was because a worker task would be given a range of transmitters to compute PQMult on. It would then linearly combine the result locally and then send back the combined modelParam to the main task. The main task would then combine these sub-combinations.

This would give different results due to floating-point associative inaccuracy ( with floating point arithmetic: (A + B) + C != A + (B + C)). Thus, in PQMult I created the new job system so the main task collects modelParams and then combines the model param in the same order.

You are right - this is also the same reason why the GPU code gets (slightly) different result. As GPU kernel operations are intrinsically parallel through different threads, the sequence of those operations are not guaranteed each time they do floating point operations.

I could probably roll these into Master_job_Distribute_Tasks. I also created the new job system in PQMult as Master_job_Distribute_Tasks makes my head hurt a bit 😅 (I really dislike how Distribute_Tasks has a two sends in it). I would hate to make Distribute_Tasks longer and more complicated, but I could try and find a good elegant way to fit DataResp and PQMult it into Distribute_Tasks if you find it a better idea.

Apologies about the long response! Thanks again for the review.

MiCurry · 2026-03-25T16:59:34Z

I do have a collection of runs with timing. It does appear that run-times are slower when running with -P. With the Cascadia case it is about 10 mins slower when running with SP2 with 21 cpus.

We generally save about ~100mb for the Cascadia case.

SP2 Runs - Cascadia Case

nodes	ncpu	cpus/node	-P On/Off	Time	Total Memory (MB) iter = 1	Iterations
1	21	21	ON	01:06:31	14,571 MB	140
1	21	21	OFF	00:55:28	14,809 MB	140
2	21	10,11	ON	00:56:44	14,842 MB	140
2	21	10,11	OFF	00:48:45	15,080 MB	140
3	21	7	ON	00:56:14	14,840 MB	140
3	21	7	OFF	00:47:16	15,080 MB	140

SP2-FG Runs - Cascadia Case

Here are some SP2-FG runs. I have noticed though that they all finish at a different number of iterations.

nodes	ncpu	cpus/node	-P On/Off	Time	Total Memory (MB) iter = 1	Iterations
1	64	64	ON	00:57:05	15,584 MB	131
1	64	64	OFF	00:50:07	15,700 MB	128
2	64	32	ON	00:24:22	16,534 MB	96
2	64	32	OFF	00:32:37	16,621 MB	150
4	64	16	ON	00:26:06	16,524 MB	135
4	64	16	OFF	00:17:24	16,545 MB	100
6	64	10	ON	00:30:05	16,640 MB	134
6	64	10	OFF	00:28:07	16,662 MB	139

@dong-hao does the SP2-FG produce different results when running with different number of tasks? I am noticing that the model finishes at different iterations and at different RMS. I even got the same results on the main branch without my changes. Is this consistent with your expectations?

I did notice in the FG commit you mention:

the block diagonal preconditioner detoriorates very quickly with the number of sub-blocks.

Is that what is occurring when running with different tasks?

dong-hao · 2026-03-26T06:21:13Z

Thanks @MiCurry - that's very valuable information!

From the SP2 runs it is quite apparent that the disk I/O is taking its toll, for a relatively small problem. The performance difference is similar for 1 node (11:03), 2 nodes (7:59) and 3 nodes (8:58). I guess modern inter-node network bandwidth should be at least comparable to that of the disk I/O. Still, as the soln files for Cascadia are quite small, the memory savings of the write-to-disk approach are not that apparent. I would expect larger memory savings (and performance differences) for larger problems, though.

I do have a collection of runs with timing. It does appear that run-times are slower when running with -P. With the Cascadia case it is about 10 mins slower when running with SP2 with 21 cpus.

We generally save about ~100mb for the Cascadia case.

SP2 Runs - Cascadia Case

nodes ncpu cpus/node -P On/Off Time Total Memory (MB) iter = 1 Iterations
1 21 21 ON 01:06:31 14,571 MB 140
1 21 21 OFF 00:55:28 14,809 MB 140
2 21 10,11 ON 00:56:44 14,842 MB 140
2 21 10,11 OFF 00:48:45 15,080 MB 140
3 21 7 ON 00:56:14 14,840 MB 140
3 21 7 OFF 00:47:16 15,080 MB 140
SP2-FG Runs - Cascadia Case

Here are some SP2-FG runs. I have noticed though that they all finish at a different number of iterations.

nodes ncpu cpus/node -P On/Off Time Total Memory (MB) iter = 1 Iterations
1 64 64 ON 00:57:05 15,584 MB 131
1 64 64 OFF 00:50:07 15,700 MB 128
2 64 32 ON 00:24:22 16,534 MB 96
2 64 32 OFF 00:32:37 16,621 MB 150
4 64 16 ON 00:26:06 16,524 MB 135
4 64 16 OFF 00:17:24 16,545 MB 100
6 64 10 ON 00:30:05 16,640 MB 134
6 64 10 OFF 00:28:07 16,662 MB 139
@dong-hao does the SP2-FG produce different results when running with different number of tasks? I am noticing that the model finishes at different iterations and at different RMS. I even got the same results on the main branch without my changes. Is this consistent with your expectations?

I did notice in the FG commit you mention:

the block diagonal preconditioner detoriorates very quickly with the number of sub-blocks.

Is that what is occurring when running with different tasks?

For the SP2FG, it is expected that different parallel setup can yield different (but similar) results. This is due the way we solve the triangular precondition matrices (L/U). Cutting a triangular matrix into a few blocks may involve cutting off the "links" between those blocks (i.e. topology that extends across the blocks) and would decrease the efficiency of the preconditioner. Cutting those matrices in different ways (3 parts, 4 parts or 5 parts, etc.) will give you slightly different results. And you are right. preconditioner deteriorates very quickly with the number of sub-blocks - which is due to the cut-off of topology between blocks.

For the 64 cpus you are using, it may not be an optimal setup, as we have 20 transmitters. So in theory it will be 1 master, 17 groups of 3 workers and 3 groups of 4 workers on that node. Apparently the 64-cpu-setup is not much faster than 21-cpu-setup as the memory bandwidth should have long been saturated with that many cpus. To make that worse, I have a (bizarre) grouping strategy to change the group sizes according to the node topology to avoid splitting blocks from one transmitter across different nodes.

In that case, a one-node setup would be:
1 master + 17 groups of three + 3 groups of four = 1 + 51 + 12 = 64
a two-node setup would look like:
node1: 1 master + 9 groups of three + 1 groups of four = 1 + 27 + 4 = 32
node2: 8 groups of three + 2 groups of four = 30 + 2 = 32
and a four-node setup would be:
node1: 1 master + 5 groups of three = 1 + 15 = 16
node2: 4 groups of three + 1 group of four = 12 + 4 = 16
node3: 4 groups of three + 1 group of four = 12 + 4 = 16
node4: 4 groups of three + 1 group of four = 12 + 4 = 16
...

So you will find quite different grouping strategies for different node setup! And naturally those groups get assigned to different transmitters, and get different results. For example, the first group of four workers will need to calculate the 18th transmitter on the 1-node-setup. They will get to calculate the 10th transmitter in the 2-node setup...

On the other hand, the sequence of the accumulation operations among those blocks would become somehow randomized in parallel. i.e. you don't know which block will come first and which will come last. That would also lead to slightly different results in each iteration - and the accumulation of these small things may eventually lead to different outcomes.

That said - I think the number of iterations does look quite different - have you checked if the results (inversion models) look consistent?

MiCurry · 2026-03-26T21:59:31Z

Yes, the savings isn't too huge. It's the size of theeAll, SolnVectorMTX on the master task. I have actually created a little Jupyter Notebook that calculates the total size: https://gist.github.com/MiCurry/caa73aca50a681bbf28eab30a534780f

# Cascadia Case
Cascadia Case - Nx: 48 - Ny: 46 - Nz: 34 - nTx: 20 - nPol: 2
- solnVector Size (MB):              7.21 MB
- cVector Size (MB):                 3.60 MB
Total Savings (MB):                144.14 MB

# USMT Array - 0.25x0.25  - 136x284x46 
# nTransmitters: 2
- solnVector Size (MB):            166.21 MB
- cVector Size (MB):                83.11 MB
Total Savings (MB):                332.43 MB

# USMT Array - 0.25x0.25 - 136x284x46 
# nTransmitters: 15
- solnVector Size (MB):            166.21 MB
- cVector Size (MB):                83.11 MB
Total Savings (MB):              2,493.19 MB

# USMT Array Case 0.1x0.1 - 304x674x47 
# nTransmitters: 2
- solnVector Size (MB):            897.01 MB
- cVector Size (MB):               448.50 MB
Total Savings (MB):              1,794.01 MB

# USMT Array Case 0.1x0.1 - 304x674x47 
# nTransmitters: 15
- solnVector Size (MB):            897.01 MB
- cVector Size (MB):               448.50 MB
Total Savings (MB):             13,455.09 MB

2 GB, 1.7 GB and 13 GB are some nice numbers for memory savings, but for the Cascadia cases its only about ~1.6% total memory savings for ~20% slower results.

I am actually surprised that the time impact isn't worse, but I suppose 20% is quite a lot and will be worse for larger models.

Not sure exactly what to think of all this at the moment. I suppose the memory savings are nice, but there might be more significant memory savings elsewhere.

I'll respond more soon!

dong-hao · 2026-03-27T03:21:24Z

Yes, the savings isn't too huge. It's the size of theeAll, SolnVectorMTX on the master task. I have actually created a little Jupyter Notebook that calculates the total size: https://gist.github.com/MiCurry/caa73aca50a681bbf28eab30a534780f

Thanks @MiCurry - that will be very useful!

I am actually surprised that the time impact isn't worse, but I suppose 20% is quite a lot and will be worse for larger models.

Not sure exactly what to think of all this at the moment. I suppose the memory savings are nice, but there might be more significant memory savings elsewhere.

Even in its current state, the -P mode gives us an important reference, and a choice. This is actually similar to the MUMPS project's "out of core memory" setup (which saves the off-process matrices on the disk, with significant speed impact). Actually I think the NCI people are originally inspired by the OOM idea...

We just need to let the users know about the trade-offs and let them decide which to use by themselves (stick with the eAll approach if they have large enough memory, etc).

On the other hand, the -P model will really shine when using ModEM on the controlled source method or ZTEM ( which I believe the NCI people did!). For MT, we will probably only deal with tens of transmitters, at most. This number can easily go beyond 1,000 (!) for CSEM. In that case, the eAll will be too expensive to save on a single node.

So let's think of it this way: it's an investment for the future ; )

This commit updates ModEM to allow the ability for labels to be passed to the esoln manager from either NLCG or LBFGS. First, this commit updates the ESolnManager so that it EsMgr_save and EsMgr_get take in a label, a character string to label the current line search equation, and a field type, to label either JmultT or electric fields (FWD) solnVectors. The worker_job_task is updated to pass a new variable called label, which can be used by worker tasks to correctly label their calculations (specifically with the EsolnManager). The Main_MPI routines, the INVCore routines are updated with pass through labels, and then the line searches of both LBFGS and NLCG have been updated to use a label when calling either func or gradient. The labels represent the location of the line search, e.g. the trial, the minimum etc. In this commit I have labeled these as F_1, for the trial, and F, for the non trial. Depending on the line search, we label our forwards calculations as F_1 or F and then use these in the gradient calculation later on. These labels allow the ESolnManager to write out its fields accordingly.

MiCurry · 2026-03-31T06:04:30Z

@dong-hao I have just pushed up a commit, eba6a9c, which adds the ability for the line searches to label the forward calculations and gradients liked you suggested.

The ESolnManager now requires a Electric field type (normally either JmultT or FWD) so the code can tell the difference between the two, and a label, which we used to identify line search calculations.

The gradient is always labeled the same as the penalty function calculation that was used to calculate it. This is mostly out of convince, rather than passing around two labels, we just pass the one label. The label we pass in the gradient calculation also tells the worker task which electric field it should read for the JmultT calculation.

I have also updated all the line searches in both NLCG and LBFGS and they all generate the same results with -P or without -P. It appears the NLCG Wolfe does not currently work? I am getting NaN's, I will create a Issue for it.

@dong-hao does all the labels look good to you? They should be easy to update.

dong-hao · 2026-04-01T09:52:26Z

Thanks @MiCurry for the update! I will try to test different inversion algorithms and report back.

@dong-hao I have just pushed up a commit, eba6a9c, which adds the ability for the line searches to label the forward calculations and gradients liked you suggested.

The ESolnManager now requires a Electric field type (normally either JmultT or FWD) so the code can tell the difference between the two, and a label, which we used to identify line search calculations.

The gradient is always labeled the same as the penalty function calculation that was used to calculate it. This is mostly out of convince, rather than passing around two labels, we just pass the one label. The label we pass in the gradient calculation also tells the worker task which electric field it should read for the JmultT calculation.

I have also updated all the line searches in both NLCG and LBFGS and they all generate the same results with -P or without -P. It appears the NLCG Wolfe does not currently work? I am getting NaN's, I will create a Issue for it.

@dong-hao does all the labels look good to you? They should be easy to update.

Sounds like a good setup to me - As for the NLCG Wolfe - have you tested that without the -P option? As I haven't really used the Wolfe with NLCG (probably for 5 years), I suspect there was a problem even before this implementation. Need to recall what I have done, before taking a look at that again...

MiCurry · 2026-04-01T19:36:20Z

@copilot please summarize the conversation in this PR! Thank you!

Copilot

Pull request overview

This PR introduces an optional “out-of-core” workflow for ModEM MPI runs where forward and adjoint E-field solutions can be persisted to disk (instead of being transferred and stored on the master), and adds plumbing to support reading those fields back during inversion steps.

Changes:

Add -P <prefix> CLI option and corresponding control flags to enable saving E-field/adjoint solutions to disk.
Introduce ESolnManager plus “placeholder” solution vectors/cvectors to avoid allocating full E-field storage on the master when file-backed mode is enabled.
Thread an optional label through forward/gradient/JmultT paths and line searches, and add MPI worker jobs for DATARESP and PQMULT in file-backed mode.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 13 comments.

Show a summary per file

File	Description
f90/UserCtrl.f90	Adds `-P` parsing and help text for file-backed E-field storage.
f90/MPI/Main_MPI.f90	Integrates ESolnManager into forward/JmultT flows; adds `DATARESP`/`PQMULT` worker jobs and label propagation.
f90/MPI/Declaration_MPI.f90	Extends worker job payload with a `label` and adds `modem_ctx` MPI context struct.
f90/Mod3DMT.f90	Initializes ESolnManager after grid setup using `modem_ctx` and user control flags.
f90/INV/NLCG.f90	Adds labeled calls into `func/gradient` during line search.
f90/INV/LBFGS.f90	Adds labeled calls into `func/gradient` during line search.
f90/INV/INVcore.f90	Adds `label` to `func/gradient` and forwards it into MPI forward/JmultT.
f90/FIELDS/FiniteDiff3D/sg_vector.f90	Adds cvector placeholder support and more robust i/o error handling.
f90/3D_MT/Sub_MPI.f90	Adds `get_nPol()` helper for per-transmitter polarization count.
f90/3D_MT/SolnSpace.f90	Adds solnVector placeholder support and solnVector file read/write helpers.
f90/3D_MT/ESolnManager.f90	New module implementing MPI vs file-backed E-field transfer/save logic.
f90/3D_MT/CMakeLists.txt	Adds ESolnManager to the build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-01T19:49:40Z

f90/UserCtrl.f90

+        k=1
+        nOptArgs = 0
+        search_arg: &
+        do
+            k=k+1
+            if (k .eq. narg) exit  search_arg
+            call getarg(k, arg)
+
+            if(arg(1:1).eq.'-') then
+                ! Set `optArgSize` to the number of argument parts: e.g. `-P [file_prefix]`
+                ! has an optArgSize == 2, one part for the `-P` and one part for `[file_prefix]`.
+                if (arg(2:2).eq.'v') then
+                    call getarg(k+1, verbose)
+                    optArgSize = 2
                end if
-          end do search_arg
+
+                if (arg(2:2) .eq. 'P') then
+                    call getarg(k+1, ctrl % prefix)
+                    ctrl % storeSolnsInFile = .true.
+                    optArgSize = 2
+                end if
+
+                nOptArgs = nOptArgs + optArgSize
+            end if


In the optional-arg scan loop, optArgSize is only assigned for -v and -P. If any other -<flag> is encountered, nOptArgs = nOptArgs + optArgSize uses an uninitialized value and can corrupt narg.

Initialize optArgSize each iteration (e.g., to 0/1) and only increment nOptArgs when an option is recognized; also validate that k+1 <= iargc() before calling getarg(k+1, ...) for options that require a value.

Copilot · 2026-04-01T19:49:41Z

f90/MPI/Main_MPI.f90


-         call Unpack_e_para_vec(eAll_out%solns(which_per))
+         modem_ctx % comm_current = comm_current
+         call EsMgr_get(eAll_out % solns(which_per), which_pol, 1, from=who)


EsMgr_get arguments are swapped here: the 2nd positional argument is iTx, but which_pol is being passed. This will set eAll_out%solns(which_per)%tx incorrectly and can read/unpack the wrong solution.

Pass iTx=which_per and pol_index=which_pol (use keywords to avoid future mixups).

Suggested change

call EsMgr_get(eAll_out % solns(which_per), which_pol, 1, from=who)

call EsMgr_get(eAll_out % solns(which_per), which_per, 1, pol_index=which_pol, from=who)

Copilot · 2026-04-01T19:49:41Z

f90/MPI/Main_MPI.f90

+             call zero_solnvector(e0)
+             call zero_solnvector(e)
+
+             call EsMgr_create_e(e, per_index)


EsMgr_create_e(e, per_index) is called before per_index is assigned in this branch (the loop assigns per_index later). This can create e with the wrong transmitter metadata (e.g., wrong nPol/pol names), which will break subsequent file reads.

Create/resize e inside the do per_index = start_iTx, end_iTx loop (or call EsMgr_create_e with start_iTx just before the loop if you reuse the same allocation).

Suggested change

call EsMgr_create_e(e, per_index)

call EsMgr_create_e(e, start_iTx)

Copilot · 2026-04-01T19:49:41Z

f90/MPI/Main_MPI.f90

+             call create_solnVector(grid, 1, e0)
+             call create_solnVector(grid, 1, e)


These create_solnVector calls hard-code iTx=1, but this PQMULT worker job processes per_index. If txDict(per_index)%nPol differs from txDict(1)%nPol, e0/e will be created with the wrong polarization count/names and subsequent EsMgr_get + PmultT/QmultT will be inconsistent.

Create the solnVectors with iTx=per_index (or use EsMgr_create_e) so the allocation matches the transmitter being processed.

Suggested change

call create_solnVector(grid, 1, e0)

call create_solnVector(grid, 1, e)

call create_solnVector(grid, per_index, e0)

call create_solnVector(grid, per_index, e)

Copilot · 2026-04-01T19:49:42Z

f90/3D_MT/ESolnManager.f90

+    integer :: esmgr_holder_bytes
+    logical :: esmgr_holder_allocated


esmgr_holder_allocated is never initialized. Since module variables are undefined by default, int_create_e_param_place_holder may incorrectly early-return, leaving esmgr_holder/esmgr_holder_bytes unset and causing MPI pack/send/recv to use invalid buffers.

Initialize esmgr_holder_allocated = .false. (and esmgr_holder_bytes = 0) at declaration time, and consider resetting/freeing the buffer when grid/polarization sizes can change.

Suggested change

integer :: esmgr_holder_bytes

logical :: esmgr_holder_allocated

integer :: esmgr_holder_bytes = 0

logical :: esmgr_holder_allocated = .false.

Copilot · 2026-04-01T19:49:43Z

f90/3D_MT/SolnSpace.f90

+         fname = construct_esoln_fname(prefix, e % tx, trim(e % pol_name(pol_index_lcl)))
+
+         if (.not. does_esoln_file_exist(e, prefix)) then
+             write(0,*) "ERROR: The file for this solnVector_t (argument e) does not exist"
+             write(0,*) "ERROR: Could not find: ", trim(fname)
+             call ModEM_abort()
+         end if


When reading a specific polarization, the existence check ignores pol_index_lcl (defaults to 1). This can incorrectly abort when pol 1 is missing but the requested pol exists, or pass when pol 1 exists but the requested pol file is missing.

Pass pol_index=pol_index_lcl into does_esoln_file_exist so the check matches fname.

Copilot · 2026-04-01T19:49:43Z

f90/INV/NLCG.f90

+    write(0,*) " USE STARTING UGESS"
+    stop


This branch unconditionally prints and calls stop when a < 0 (negative curvature). That will terminate the inversion in legitimate scenarios where the quadratic fit has no minimum.

Remove the stop and keep the original fallback behavior (take the starting guess and proceed to gradient computation / exit line search).

Suggested change

write(0,*) " USE STARTING UGESS"

stop

Copilot · 2026-04-01T19:49:43Z

f90/3D_MT/SolnSpace.f90

+         integer :: pol_index_lcl
+
+         if (.not. e % allocated) then
+             write(0,*) "ERROR: solnVector_t (argument e) msut be allocated before calling write_solnVector"


Typo in error message: "msut" should be "must".

Suggested change

write(0,*) "ERROR: solnVector_t (argument e) msut be allocated before calling write_solnVector"

write(0,*) "ERROR: solnVector_t (argument e) must be allocated before calling write_solnVector"

Copilot · 2026-04-01T19:49:44Z

f90/FIELDS/FiniteDiff3D/sg_vector.f90

+         else if (istat > 0) then
+            write(0, '(A, A, A)') "ERROR: Erorr when reading E % x: ", trim(fname), ' aborting!'
+            write(0, '(A, i4.4, A, A)') 'ERROR: Error read iostat code:', istat, ' Message: ', trim(iomsg)
+            call ModEM_abort()
+         end if


Typo in error message: "Erorr" should be "Error".

Copilot · 2026-04-01T19:49:44Z

f90/INV/INVcore.f90

   call CdInvMult(res)
 #ifdef MPI
-        call Master_job_JmultT(m,res,JTd,eAll)
+        call Master_job_JmultT(m,res,JTd,eAll,label=label)


label is an optional dummy argument, but it’s being passed through unconditionally as label=label here. There are multiple call sites that invoke gradient(...) without label (e.g., NLCG/LBFGS), which makes it illegal to reference/pass label and can lead to runtime failure.

Use label=label_lcl (since you already normalize it), or guard the call with if (present(label)) ... else ....

Suggested change

call Master_job_JmultT(m,res,JTd,eAll,label=label)

call Master_job_JmultT(m,res,JTd,eAll,label=label_lcl)

MiCurry requested a review from akelbert March 4, 2026 23:25

MiCurry added 11 commits March 5, 2026 12:42

Add in error checking when writing and reading cvectors

f58c3c2

This commit adds error checking when reading and writing cvectors. Now that these routines will be used more often, it will be beneficial to ensure that these routines complete successfully.

Initialize ModEM Context for use

ba2f56a

This commit initializes the ModEM context in Constructor_MPI.

Implement ESolnManager into ModEM MF

61c01a4

This commit finally uses the ESolnManager into ModEM. As of this commit it generates bit-for-bit results when running with or without saving the electric field solutions.

MiCurry force-pushed the esoln/ESolnManager branch from 2ec752e to 593076d Compare March 9, 2026 16:46

MiCurry marked this pull request as ready for review March 9, 2026 16:55

MiCurry requested a review from dong-hao March 9, 2026 17:05

MiCurry commented Mar 9, 2026

View reviewed changes

MiCurry commented Mar 11, 2026

View reviewed changes

MiCurry added 2 commits March 12, 2026 14:57

Correct error in DataResp that caused an error with 2, 3, 4 procs

87d5867

The last commit to allow SP2 to work broke the DataResp when using 2, 3 , and 4 procs, and potentially others. The culprit was using size_leader instead of using size_leader - 1 in places.

Remove comm input from Master_job_DataResp

5dfdd81

MiCurry force-pushed the esoln/ESolnManager branch from 06dd68a to 5dfdd81 Compare March 12, 2026 21:42

dong-hao requested changes Mar 20, 2026

View reviewed changes

Actually fix error with 2,3,4 procs

a6c8f02

MiCurry added 4 commits March 20, 2026 15:38

Use size_current instead of size_leader

1a7f459

Add timers around PQMult and DataResp

2c85c31

Remove debug print statements

91df140

MiCurry requested a review from Copilot April 1, 2026 19:42

Copilot started reviewing on behalf of MiCurry April 1, 2026 19:43 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

	call EsMgr_get(eAll_out % solns(which_per), which_pol, 1, from=who)
	call EsMgr_get(eAll_out % solns(which_per), which_per, 1, pol_index=which_pol, from=who)

	call EsMgr_create_e(e, per_index)
	call EsMgr_create_e(e, start_iTx)

		call create_solnVector(grid, 1, e0)
		call create_solnVector(grid, 1, e)

		integer :: esmgr_holder_bytes
		logical :: esmgr_holder_allocated

	write(0,*) "ERROR: solnVector_t (argument e) msut be allocated before calling write_solnVector"
	write(0,*) "ERROR: solnVector_t (argument e) must be allocated before calling write_solnVector"

	call Master_job_JmultT(m,res,JTd,eAll,label=label)
	call Master_job_JmultT(m,res,JTd,eAll,label=label_lcl)

Conversation

MiCurry commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MiCurry commented Mar 6, 2026

Uh oh!

MiCurry commented Mar 9, 2026

Uh oh!

MiCurry commented Mar 9, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MiCurry Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MiCurry commented Mar 16, 2026

Uh oh!

dong-hao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MiCurry commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dong-hao commented Mar 25, 2026

Uh oh!

MiCurry commented Mar 25, 2026

Uh oh!

MiCurry commented Mar 4, 2026 •

edited

Loading

MiCurry Mar 11, 2026 •

edited

Loading

MiCurry commented Mar 20, 2026 •

edited

Loading

MiCurry commented Mar 31, 2026 •

edited

Loading

dong-hao commented Apr 1, 2026 •

edited

Loading

MiCurry commented Apr 1, 2026 •

edited

Loading