Skip to content

Add the ability to save electric and adjoint fields to disk (ESolnManager)#42

Open
MiCurry wants to merge 21 commits intomagnetotellurics:mainfrom
MiCurry:esoln/ESolnManager
Open

Add the ability to save electric and adjoint fields to disk (ESolnManager)#42
MiCurry wants to merge 21 commits intomagnetotellurics:mainfrom
MiCurry:esoln/ESolnManager

Conversation

@MiCurry
Copy link
Copy Markdown
Member

@MiCurry MiCurry commented Mar 4, 2026

This is a cleaned up pull request of #16.

This pull requests implements the optional ability to run inversions, forwards and JMultT with the ability for worker tasks to save their electric and ad-joint fields to disk, rather then sending them between the master task.

If the optional -P argument is passed on the command line, with a prefix string: -P esoln, ModEM will write out the electric fields to disk when computing the forward or adjoint solutions.

When running an inversion with NLCG, the worker tasks will read in the electric field that matches the transmitter they have been told to compute adjoint solution of rather then getting the electric field from the master task.

Note: When running the NLCG inversion with -P, worker tasks do not use unique filenames between iterations, thus when a worker tasks writes out the electric or adjoint field for iteration 2, it will overwrite the files that were written in iteration 1.

These changes now work with the SP2 Fined-Grained. Running with the same number of processors with and without saving the electric field/adjoint solutions gives bit-for-bit results. However, it appears the FG does not produce bit-for-bit results on different processor counts even before this change. It also looks like the FG produces different results when comparing SP2 with and without SP2.

Note: I still have not tested these changes with the GPU code, but will do so soon.

@MiCurry MiCurry requested a review from akelbert March 4, 2026 23:25
MiCurry added 11 commits March 5, 2026 12:42
This commit updates UserCtrl to allow a new final, optional, argument -P. This
argument takes in a character, [file_prefix]. When passed to ModEM, ModEM will
set `ctrl % storeSolnsInFile` to `.true.` and will set `ctrl % prefix` to
[file_prefix].

This commit also updates the `search_args` do loop in `parseArgs` to allow it
to better handle additional final optional arguments. It should be easy now to
expand it to allow additional optional arguments. Just ensure you set
`optArgSize` correctly. For single, standalone options without an argument,
`optArgSize` should be 1, options with an argument should set `optArgSize` to
2.

This commit also updates the user guide with the details of this new optional
argument for inversion.
This commits creates three new routines in SolnSpace that allow the reading and
writing of solnVector_t types. The routines will read and write the active
transmitter and the active polarization of the passed in solnVector_t.

These routines will be used to save partial solutions, thus, perhaps they are a
bit limited in their interface. It may be appropriate to have a routine that
takes in desired iTx and polarization and allocates an solnVector_t, rather
than requiring it be already allocated and created. I'll think about that, but
for now this should be sufficient.
This commit creates an ModEM_mpi_context_t type which will be used to store and
pass around MPI communicators to the EsolnManager.

If it works well, perhaps in future merges it can be used to in other areas of
ModEM.

As of this commit, the ModEM_mpi_context_t is not used, but will be in a future
commit.
This commit allows `solnVectors_t` and `cvectors_t` to be created as 'place holders'.
These 'place holder' `solnVectors_t` and `cvectors_t` will only contain the
information on that electrical solution and wont contain any data.

In conjunction with the `ESolnmManager` this will allow the main task to *not*
allocated eAll, saving a bit of unnecessary memory.
This commit simply adds the ESolnManager to modem code base and adds it to the
CMakeLists.txt.

The ESolnManager will serve as a manager of electric and adjoint fields for the
main task and the worker tasks. Depending on the arguments passed in, it will
either have the worker tasks communicate the electric and adjoin fields via
writing and reading them to disk, or it will send the fields to and from the
main task using MPI (the current ModEM implementation).

Perhaps in a future commit the ESolnManager will be responsible for allocating
and deallocating the solnVectors_t.
This commit adds error checking when reading and writing cvectors. Now that
these routines will be used more often, it will be beneficial to ensure that
these routines complete successfully.
This commit adds in a 'trial' logical to the define_worker_job type. This
logical will be used in future commits to signify to worker tasks that a trial
is being calculated.

This will allow the EsolnManager the ability to save files with an appropriate
filename (e.g. labeling them as 'trial').
This commit adds in the PQMult and DataResp Master and worker jobs. These
routines will be used with the ESolnManager implementation.

It also udpates create_data_vec_place_holder and associated routines so that
they can take a range of data instead of just one.
This commit initializes the ModEM context in Constructor_MPI.
This commit adds a new flag to func and to gradient. The flag in func can be
used to signify to the worker tasks that a trial forward is being calculated.
The worker_job in the future will use this to write out it's electric field
to a file with the appropriate prefix.

When we use the initial guess in the line search we will also need to tell the
JMultT calculations to read the trial electric field, rather than the
computations of the line search. Thus, we need to tell the gradient, and JmultT
to use the starting guess. Worker jobs will then be able to know if they should
read from the trial file or not.
This commit finally uses the ESolnManager into ModEM. As of this commit it
generates bit-for-bit results when running with or without saving the electric
field solutions.
@MiCurry
Copy link
Copy Markdown
Member Author

MiCurry commented Mar 6, 2026

This branch works with SP and the SP2 forward formulations; however, it does not work if you use the Fined Grained (FG) version of SP2. I'll look into resolving that.

This commit updates the Main_MPI code that was added in previous commits to
allow the new MPI code associated with the ESolnManager to work with SP2 FG.

To do this, it updates the DataResp and the PQMult to not send their MPI send
or receives on the MPI_COMM_WORLD but on either comm_world or comm_leader
depending on the para_method.

For both the DATARESP and the JQMULT, it performs both by sending the work to
the leader tasks. Because these tasks are rather simple and fast, the leader
tasks do not send the worker tasks anything.

When running with FG on equal processor counts with or without -P gives
bit-for-bit result.

It appears that the running with FG with different processor counts does not
give bit-for-bit results, this may be an issue before this task. Running
without FG at different processor counts gives the same results, but there are
difference when running with and without FG.

I'll open and investigate that issue further.
@MiCurry MiCurry force-pushed the esoln/ESolnManager branch from 2ec752e to 593076d Compare March 9, 2026 16:46
@MiCurry MiCurry marked this pull request as ready for review March 9, 2026 16:55
@MiCurry
Copy link
Copy Markdown
Member Author

MiCurry commented Mar 9, 2026

I think this PR is ready for a review! It now works with MF, SP, SP2 and SP2 Fined Grained! See the note in the original comment on the FG method producing different bit-for-bit results, I'll open an issue about that.

I still have not yet tested this when running with GPU, but I will do that and add any updates, but since this works with all the forward-formulations and the FG branch I think it's ready for a review from others! @akelbert and @dong-hao feel free to give it a review if you have the time!

@MiCurry MiCurry requested a review from dong-hao March 9, 2026 17:05
@MiCurry
Copy link
Copy Markdown
Member Author

MiCurry commented Mar 9, 2026

Also, it's worth noting that still haven't done much evaluation of the impact of the IO when all the tasks are writing out. I have dome some preliminary memory usage differences, but I want to do some more formal memory consumption evaluations. I'll evaluate the impact on runtime by the IO when I do that evaluation.

remainder = modulo(nTx, size_leader)
iTx_max = 0

do dest = 1, size_leader
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be size_leader - 1?

remainder = modulo(nTx, size_leader)
iTx_max = 0

do dest = 1, size_leader
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be size_leader - 1?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, but we should think about how we can simplify the comm-related parameters.

cycle
end if

do dest = 1, size_leader
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be size_leader - 1?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, but we should think about how we can simplify the comm-related parameters.

comm_current = comm_leader
end if

modem_ctx % comm_current = comm
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think modem_ctx % comm_current should be set to comm_current

!----------------------------------------------------------------------------
!########################## Master_job_DataResp ############################
subroutine Master_job_DataResp(nTx, sigma, d, trial)
subroutine Master_job_DataResp(nTx, sigma, d, trial, comm)
Copy link
Copy Markdown
Member Author

@MiCurry MiCurry Mar 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks strange here, we are passing in comm, but perhaps not using it correctly. See below comment. Do we even need to pass in comm here?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MiCurry for the update!

Hmmm, normally we don't need a MPI_COMM input, as it is a module parameter that can be accessed by anyone in the module. I think it would generally be safer to just use comm_current from the module.

There are rare exceptions of course, when doing fine-grained parallel calculations as the worker processes need to coordinate with each other out of the Main MPI module.

I can take a look of this - but it may take some time as I will need to finish a report tomorrow.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment @dong-hao and info!

It will be great to have your review, but don't feel rushed to review it. We're not working any under deadlines so there is no rush

MiCurry added 2 commits March 12, 2026 14:57
The last commit to allow SP2 to work broke the DataResp when using 2, 3 , and 4
procs, and potentially others. The culprit was using size_leader instead of
using size_leader - 1 in places.
@MiCurry MiCurry force-pushed the esoln/ESolnManager branch from 06dd68a to 5dfdd81 Compare March 12, 2026 21:42
@MiCurry
Copy link
Copy Markdown
Member Author

MiCurry commented Mar 16, 2026

I have confirmed that this branch works when building and running with CUDA.

Copy link
Copy Markdown
Member

@dong-hao dong-hao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MiCurry for the update - I finally got some time to test the code.

I can confirm that the -P mode runs without problem for both the SP and the MF, at least for the "optimal parallel setup" i.e. use N+1 processes for N jobs. For the cascade case it is 21/20.

However, when we are on some (not that) edge cases, it threw some errors when doing the adjoint calculation:

node[001]:    total time elapsed (sec):    17.302900
    FORWARD: Receive Per #     9 and Pol    2      from     1
    FORWARD: TX finished #    20 out of    20
 node[001]:    MPI TASK [                      DATARESP] received from     0
 node[003]:    MPI TASK [                      DATARESP] received from     0
 node[002]:    MPI TASK [                      DATARESP] received from     0
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
  Not reading the trial
 node[002]:    Waiting for a message from Master
 node[002]:    total time elapsed (sec):    15.374959
 node[003]:    Waiting for a message from Master
 node[003]:    total time elapsed (sec):    17.953275
 node[001]:    Waiting for a message from Master
 node[001]:    total time elapsed (sec):    17.302900
 [Haos-Mac-mini:00000] *** An error occurred in MPI_Recv
 [Haos-Mac-mini:00000] *** reported by process [2488401921,0]
 [Haos-Mac-mini:00000] *** on communicator MPI_COMM_WORLD
 [Haos-Mac-mini:00000] *** MPI_ERR_RANK: invalid rank
 [Haos-Mac-mini:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
 [Haos-Mac-mini:00000] ***    and MPI will try to terminate your MPI job as well)
 --------------------------------------------------------------------------
 prterun has exited due to process rank 0 with PID 0 on node Haos-Mac-mini calling
 "abort". This may have caused other processes in the application to be
 terminated by signals sent by prterun (as reported here).
 --------------------------------------------------------------------------

I was using 4 processes on my mac-mini for 20 jobs in this case. Apparently, as @MiCurry pointed out, size_leader -1 should be used (instead of size_leader) when the master sends out the DataResp and PQmult jobs for the leaders/workers. The adjoint calculation runs without problem after this correction.

I guess the 21/20 setup worked by sheer luck as the default size_leader = size_world ...

On the other hand, we are now mixing the usage of the different communicators. Need to think about the logic and clarify the usage of those (comm_world, comm_leader, comm_local, etc.) across the related modules.

Also - we need to follow the current logic of the job system for the new jobs (DataResp and PQmult). That said, we may want to re-consider this framework - a more consistent and flexible job system is preferable. For the current logic there are quite some compromises here and there.

Hao

type(define_worker_job), save :: worker_job_task
!********************************************************************

type :: ModEM_mpi_context_t
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to unify the MPI related parameters with the datatype MODEM_MPI_CNOTEXT_T here, we probably also want to remove the scattered comm_world, comm_local, etc. in the declaration.

  • need also include the comm_leader related handles in the data type, as well.

We can leave that to another PR, though.

remainder = modulo(nTx, size_leader)
iTx_max = 0

do dest = 1, size_leader
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, but we should think about how we can simplify the comm-related parameters.

nTx = d1%nTx
starttime = MPI_Wtime()
! over-ride the default communicator, if needed
if (present(comm)) then ! given communicator
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the optional "comm" here (and other forward/adjoint related subroutines) is relevant to the fine-grain methods (FG and PETSc). Essentially this just tells the subroutine about "who calls you" and "whom you should communicate" - it is not easy to determine that without the input.
I hope we can remove the comm-related troublesome (which also causes some extra logic in the EMsolve3D module as well). But we need to think carefully about that (and leave that to another PR)!

remainder = modulo(nTx, size_leader - 1)
iTx_max = 0

do dest = 1, size_leader - 1
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is probably better to use comm_current/size_current to avoid any mismatches - master can talk with all workers (comm_world) or only the leaders (comm_leader).
the below MPI_SEND is using comm_current.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

cycle
end if

do dest = 1, size_leader
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, but we should think about how we can simplify the comm-related parameters.

logical :: keep_E_soln=.false.
logical :: several_Tx=.false.
logical :: create_your_own_e0=.false.
logical :: trial=.false.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"trial" doesn't sound very informative...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, and it's used twice to signal the task to perform two different, but related events:

  1. Write out your electric field to a file with 'trial' appended to it's filename
  • This indicates that this forward computation is the trial line search calculation
  1. Read the 'trial' appended filename for your JmultT calculation.

The Stable-NCI branch does things a little differently to handle the

  1. It has a Master_job_update_prefix to tell worker jobs to update the prefix written out for the electric field solution.
    • Do this when calculating the line search's trial, so we can backtrack if the trial is better than the line search.
  2. If the trial is a better choice in the line search, then the master job renames the 'trail.prefix.*.esoln' files are to prefix.*.esoln.
  3. The worker tasks always reads in prefix.*.esoln.

Perhaps that is a better way? Any thoughts @dong-hao?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MiCurry for the thoughts - Either is fine as those solutions are no longer used after the current iteration. The good part for the NCI is the (slightly) less disk usage - and saving the final model response for the last iteration as a bonus.

However, this ultimately also depends on which inversion algorithm/search method are you using. i.e. we may need to deal with solns from more than two forward (FWD) and one adjoint (JmultT) evaluations. I know that you have just implemented the NLCG part (that's the only inversion algorithm by the time the NCI people got involved), but I guess it is better to fit in more "general" cases.

For Anna's quadratic line search method (default for NLCG), we need two FWD and one JmultT solutions - the "trial" convention would be good enough ("trial" is called "startup" in Anna's convention). This is also OK for my BFGS Wolfe line search (same two FWD and one JmultT solutions) and Wolfe2 line search (normally only one FWD and one JmultT solutions) for most of the time.

However, it is possible that the quadratic line search fails to satisfy the Armijo condition in NLCG, and falls back to cubic line search, which requires three FWD calculations. And for the Wolfe2 line search, it is also possible (~5 % chance) that the f_1 (FWD) and g_1 (JmultT) do not satisfy the Wolfe condition, leading to two FWD and two JmultT solutions.

In other words, our naming convention and the write/load mechanism will need to deal with three (or more) forward solutions and two JmultT solutions! NOTE that we have not considered DCG inversion (which needs even more JmultTs). We can leave that later...

So I think a straight forward idea is just use numbers...

say, something like:

esoln.FWD01.iTx.0001.Ex.cvec
esoln.FWD02.iTx.0001.Ex.cvec
esoln.JmultT01.iTx.0001.Ex.cvec
esoln.JmultT02.iTx.0001.Ex.cvec

My grain of salt, of course. But we do need to think about that carefully - Not sure if @akelbert has a better idea.

For a detailed info of the Wolfe2 line search, see my notes in the BFGS algorithm:

    ! Note: inexact line searches ultimately fit into two catalogs - 
    ! i.e. Armijo–Goldstein (back-tracking) and Wolfe conditions
    ! the latter also contains a few different criterias: (1) "Armijo 
    ! rule", and (2) "curvature condition", the latter requires an extra 
    ! gradient calculation at x_k + alpha_k*d_k
    ! (1) and modified (2) will form the "strong Wolfe" condition which is 
    ! the line search "convergence criteria" here
 
    ! see lineSearchWolfe subroutine for details of Anna's line search method
    ! (Armijo)
    ! 
    ! the really good part for LBFGS is, its search directions are sort of 
    ! "normalized". so there is a pretty good chance that the first guess (one)
    ! is already a good search step that satisfies Wolfe's rule.
    ! For most of the time, it is not even necessary to do a line "search".
    ! Also, if one examines the LBFGS line search (as in lineSearchWolfe), it
    ! is quite often that the f at quadratic interpolation does not improve 
    ! too much when compared with the first guess f_1. 
    ! 
    ! So here is my idea: 
    !
    ! when we have evaluated the penalty function for the initial guess (f_1), 
    ! we immediately calculate the gradient (g_1) at the initial guess.  
    ! 1) if f_1 and g_1 satisfy the Wolfe's rule, then we just proceed to the
    !    next iteration. The calculation cost is 1 penalty funtion evaluation
    !    and 1 gradient calculation. Everyone is happy(?)
    ! 
    ! 2) if f_1 and g_1 does not satify Wolfe's rule, we use a quadratic interp
    !    to find a minimum (f), and calculate the gradient (g) at the point. 
    !    if f and g satisfy the rule, then we proceed to the next iteration.
    !    The cost is 2 penalty function evaluations and 2 gradient calculations.
    !    this is actually worse than Anna's scheme (two f and one grad). 
    !    but don't worry yet, normally this should be a rare case (<5 %)
    !
    ! 3) if neither of the 1) nor 2) is satisfied, the scheme falls back to the 
    !    bracketing and sectioning with my variant of More-Thuente
    !    scheme
    !
    ! following Anna's idea, the major intention here is to use a cheap line
    ! search scheme (with merely 2 forward-like-calculations) to quickly skip 
    ! to a small overall penalty function level and only to start bracketing
    ! when the quadratic interpolation doesn't work, in which case cubic 
    ! probably won't work either...
    ! 
    ! in practice, wolfe2 should be slower (as the line search is not exact)
    ! in convergence (but still faster in time as the line search scheme is 
    ! 1/3 faster. 
    ! this may lead to quicker stop (or the update of the damping factor), as
    ! the descend may not meet the requirement of the "enough progress". 
    ! so I made a small modification in beta to allow the LBFGS to stall a
    ! little longer in each stage. 
    ```
    

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the response! A lot of considerations I am not purvey too. I like your naming scheme.

Perhaps we can include a label in define_worker_job where the master job can specify the naming scheme. We could update Master_job_FwdPred or Master_job_JmultT to take in an optional label.

That would be better than if we have a Master_job_update_prefix where prefix only means something if we are running with -P.

I suppose that we could name that function Master_job_update_label, or something more abstract, and worker jobs could hang on to the updated name and use it to label their electric fields/adjoint files. The worker jobs could also use this to label in other ways, potentially debugging.

I feel like I like updating define_worker_job more. That adds a bit more flexibility to sending worker jobs without needing to add additional subroutine calls.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good approach. While I don't really foresee any urgent needs of "scale-up parallelization" for ModEM, having the ability to label the soln files would be useful for multi-node, multi-device debugging tasks.

call linComb_modelParam(ONE,dsigma,ONE,dsigma_temp,dsigma)
end do
if (EsMgr_save_in_file) then
call Master_job_PQMult(nTx, sigma, dsigma, use_starting_guess=use_starting_guess_lcl)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we want to wrap the two conventions (from file, or from MPI) together with a unified Master_job_PQMult (let the workers do the job all-together)? Need to think about it carefully.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. Then we can have a the logic for those both PQMult and DataResp in their own subroutine.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed up commits to wrap the DataResp and PQMult into their own functions. I don't feel like it would be a hard lift to allow Master_job_PQMult_IO/Master_job_DataResp_IO to work with -P or without it. The EsolnManager does give some flexibility in that manner.

However, I was curious if there was a major time impact of having the workers jobs compute DataResp or PQMult.

With SP2, I ran the cascadia case and timed the DataResp and PQMult functions with 21 CPUs on 1 node of Denali and timed the total time of the PQMult and DataResp:

Saving saving electric fields/adjoint files - SP2 - Cascadia - 140 iterations
21 tasks - 1 node - Denali
 ============= ModEM Run Report ============
Timer: '                        PQMult'          Elapsed Time: 00:00:06:800257920
Timer: '                      DataResp'          Elapsed Time: 00:00:06:645628622
Timer: '                 Total Inverse'          Elapsed Time: 01:04:08:776611328
Timer: '                    Total Time'          Elapsed Time: 01:04:09:147460944
 ============================
Not saving saving electric fields/adjoint files - SP2 - Cascadia - 140 iterations
21 tasks - 1 node - Denali
 ============= ModEM Run Report ============
Timer: '                        PQMult'          Elapsed Time: 00:00:10:384921768
Timer: '                      DataResp'          Elapsed Time: 00:00:08:204588658
Timer: '                 Total Inverse'          Elapsed Time: 00:58:46:270996096
Timer: '                    Total Time'          Elapsed Time: 00:58:46:538085952
 ============================

The savings are not huge in the grand scheme of things, but perhaps it would be worth making it so PQMult and DataResp are always calculated by the workers?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MiCurry for the test - yes, I do want to know how much the read-write of the solutions can impact on the general efficiency. I am a bit surprised that it is even faster when you save the solutions on disk!

Indeed, that's probably because of the distributed calculation of PQMult on the workers? You are right, It worth trying to let the workers do the PQmult/DataResp things - sounds like a good trade-off for code complexity/efficiency.

time_passed = now - previous_time
previous_time = now

elseif (trim(worker_job_task%what_to_do) .eq. 'DATARESP') then
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

principally this job has nothing to do with the "workers" in the fine-grain parallel (only involves the "leaders"). But we should double check if there are some edge cases...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes... It does appear that both the DataResp and PQMult are relatively quick calculations, at least for the Cascadia case. But I suppose these scale with the number of stations and the number of periods, so it might be worth utilizing the worker jobs when nSites/nTx is large.

I suppose though this is an improvement even with the worker tasks not doing anything, as before all the worker tasks were waiting for the master task to finish.

I have been starting to work with the USMT Array data (USArray), and it might cause some issues with DataResp/PQMult as it has a large data files.

If DataResp and PQMULT is significantly slow for large data files, then I think DataResp and PQMULT could be. be easily split up to the sub-worker jobs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True - this should generally be very useful as to overlap the calculations (for different Tx). On the other hand, the USMT array dataset test would be a good indicator, on whether we really need it.

Cross-node file I/O would be the key uncertainty here. Could you @MiCurry try some different node set-ups, to test its impact? For instance, "50 jobs on 1 node" and "50 jobs on 5 nodes" may give us different results.

deallocate(data_para_vec)
data_para_vec => null()

elseif (trim(worker_job_task%what_to_do) .eq. 'PQMULT') then
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to the DataResp job - principally this job has nothing to do with the "workers" in the fine-grain parallel (only involves the "leaders"). But we should double check if there are some edge cases...

@MiCurry
Copy link
Copy Markdown
Member Author

MiCurry commented Mar 20, 2026

@dong-hao thanks for the review! I'll respond to your comments you made above, but some larger discussions:

I was using 4 processes on my mac-mini for 20 jobs in this case. Apparently, as @MiCurry pointed out, size_leader -1 should be used (instead of size_leader) when the master sends out the DataResp and PQmult jobs for the leaders/workers. The adjoint calculation runs without problem after this correction.

Woops! Looks like that change didn't get pushed up fully 🤦🏻 . It should be good now!

If we want to unify the MPI related parameters with the datatype MODEM_MPI_CNOTEXT_T here, we probably also want to remove the scattered comm_world, comm_local, etc. in the declaration.

need also include the comm_leader related handles in the data type, as well.
We can leave that to another PR, though.

Even before your review, I have been considering removing the ModEM_mpi_context_t. It was one of the first things I came up with when implementing this feature, and now after working with the code for a while I don't really think it does what I intended it to do and just makes things a bit more and confusing.

I will remove it and perhaps we can consider a change to simplify the communicators in the future. We could have some helper subroutines that returns a communicator based on the para_method and the rank. All for a future change.

Also - we need to follow the current logic of the job system for the new jobs (DataResp and PQmult). That said, we may want to re-consider this framework - a more consistent and flexible job system is preferable. For the current logic there are quite some compromises here and there.

Are you suggestion here that I utilize Master_job_Distribute_Tasks instead of having DataResp and PQMult create thier own job systems? I would be willing to consider that.

For reference, the DataResp routine for passing jobs was copied from the work done in the Stable_NCI branch. I really don't like the logic of that routine, but I have kept it instead of rewriting it.

The PQMult had the exact same logic, but running with different number of tasks would give slightly different results. This was because a worker task would be given a range of transmitters to compute PQMult on. It would then linearly combine the result locally and then send back the combined modelParam to the main task. The main task would then combine these sub-combinations.

This would give different results due to floating-point associative inaccuracy ( with floating point arithmetic: (A + B) + C != A + (B + C)). Thus, in PQMult I created the new job system so the main task collects modelParams and then combines the model param in the same order.

I could probably roll these into Master_job_Distribute_Tasks. I also created the new job system in PQMult as Master_job_Distribute_Tasks makes my head hurt a bit 😅 (I really dislike how Distribute_Tasks has a two sends in it). I would hate to make Distribute_Tasks longer and more complicated, but I could try and find a good elegant way to fit DataResp and PQMult it into Distribute_Tasks if you find it a better idea.

Apologies about the long response! Thanks again for the review.

MiCurry added 4 commits March 20, 2026 15:38
This commit responds to one of Hao's comments on the PR. It moves the DataResp
logic into three new functions. These are:

* Master_job_DataResp - Interface function that decides what function to call
below if ESolnManager % EsMgr_save_in_file is true or false.
* Master_job_DataResp_main_only - The originally way to calculate dataresp.
Only the main task calculates it.
* Master_job_DataREsp_IO - The workers compute the dataresp for a range of
transmitters and read the electric fields from files.
This commit separates out the PQMult logic into three new functions:

* Master_job_PQMult - Interface function that decides what function to call if
  ESoln_save_in_file is either .true. or .false.
* Master_job_PQMult_main_only - The original way that PQMult was calculated. Is
  only calculates on the main task.
* Master_job_PQMult_IO - The workers compute PQMult by reading in the electric
  fields and the adjoint fields.
@dong-hao
Copy link
Copy Markdown
Member

Sorry @MiCurry - I didn't see this comment yesterday...

@dong-hao thanks for the review! I'll respond to your comments you made above, but some larger discussions:

I was using 4 processes on my mac-mini for 20 jobs in this case. Apparently, as @MiCurry pointed out, size_leader -1 should be used (instead of size_leader) when the master sends out the DataResp and PQmult jobs for the leaders/workers. The adjoint calculation runs without problem after this correction.

Woops! Looks like that change didn't get pushed up fully 🤦🏻 . It should be good now!

Thanks for the update!

If we want to unify the MPI related parameters with the datatype MODEM_MPI_CNOTEXT_T here, we probably also want to remove the scattered comm_world, comm_local, etc. in the declaration.
need also include the comm_leader related handles in the data type, as well.
We can leave that to another PR, though.

Even before your review, I have been considering removing the ModEM_mpi_context_t. It was one of the first things I came up with when implementing this feature, and now after working with the code for a while I don't really think it does what I intended it to do and just makes things a bit more and confusing.

I will remove it and perhaps we can consider a change to simplify the communicators in the future. We could have some helper subroutines that returns a communicator based on the para_method and the rank. All for a future change.

I think it's fine to have this context - as I mentioned, we just don't want two systems mixing together. But we do need to consider simplifying the communicators - a helper function or subroutine would be nice so that we can save probably dozens of if/else pairs throughout the MPI module.

Also - we need to follow the current logic of the job system for the new jobs (DataResp and PQmult). That said, we may want to re-consider this framework - a more consistent and flexible job system is preferable. For the current logic there are quite some compromises here and there.

Are you suggestion here that I utilize Master_job_Distribute_Tasks instead of having DataResp and PQMult create thier own job systems? I would be willing to consider that.

In the original logic by Naser, the worker is always waiting for some task, from someone. So instead of a "job queue" found in many other projects, we are essentially using master_job_distribute_tasks to manage the jobs (who gets a job, who will be the next, etc.). So it is preferable to keep this job system (at least for now).

But I agree that we need to think about a better job management system. The current distribute_tasks logic wasn't really designed for the current two-layered parallelization. You probably have seen many ad-hoc communications that I fiddled here and there to manage the "leader-worker" jobs.

This would be a turmoil, especially considering that the PQmult is not a "single task". A worker may combine a number of Txs before reporting back to the Master. A possible compromise is to have the master to send the jobs one-by-one and accumulate the results whenever the worker reports back? I hope we can have an elegant way to manage them (at some stage). But if it is too complicated, we may want to leave that to another PR.

For reference, the DataResp routine for passing jobs was copied from the work done in the Stable_NCI branch. I really don't like the logic of that routine, but I have kept it instead of rewriting it.

The PQMult had the exact same logic, but running with different number of tasks would give slightly different results. This was because a worker task would be given a range of transmitters to compute PQMult on. It would then linearly combine the result locally and then send back the combined modelParam to the main task. The main task would then combine these sub-combinations.

This would give different results due to floating-point associative inaccuracy ( with floating point arithmetic: (A + B) + C != A + (B + C)). Thus, in PQMult I created the new job system so the main task collects modelParams and then combines the model param in the same order.

You are right - this is also the same reason why the GPU code gets (slightly) different result. As GPU kernel operations are intrinsically parallel through different threads, the sequence of those operations are not guaranteed each time they do floating point operations.

I could probably roll these into Master_job_Distribute_Tasks. I also created the new job system in PQMult as Master_job_Distribute_Tasks makes my head hurt a bit 😅 (I really dislike how Distribute_Tasks has a two sends in it). I would hate to make Distribute_Tasks longer and more complicated, but I could try and find a good elegant way to fit DataResp and PQMult it into Distribute_Tasks if you find it a better idea.

Apologies about the long response! Thanks again for the review.

@MiCurry
Copy link
Copy Markdown
Member Author

MiCurry commented Mar 25, 2026

I do have a collection of runs with timing. It does appear that run-times are slower when running with -P. With the Cascadia case it is about 10 mins slower when running with SP2 with 21 cpus.

We generally save about ~100mb for the Cascadia case.

SP2 Runs - Cascadia Case

nodes ncpu cpus/node -P On/Off Time Total Memory (MB) iter = 1 Iterations
1 21 21 ON 01:06:31 14,571 MB 140
1 21 21 OFF 00:55:28 14,809 MB 140
2 21 10,11 ON 00:56:44 14,842 MB 140
2 21 10,11 OFF 00:48:45 15,080 MB 140
3 21 7 ON 00:56:14 14,840 MB 140
3 21 7 OFF 00:47:16 15,080 MB 140

SP2-FG Runs - Cascadia Case

Here are some SP2-FG runs. I have noticed though that they all finish at a different number of iterations.

nodes ncpu cpus/node -P On/Off Time Total Memory (MB) iter = 1 Iterations
1 64 64 ON 00:57:05 15,584 MB 131
1 64 64 OFF 00:50:07 15,700 MB 128
2 64 32 ON 00:24:22 16,534 MB 96
2 64 32 OFF 00:32:37 16,621 MB 150
4 64 16 ON 00:26:06 16,524 MB 135
4 64 16 OFF 00:17:24 16,545 MB 100
6 64 10 ON 00:30:05 16,640 MB 134
6 64 10 OFF 00:28:07 16,662 MB 139

@dong-hao does the SP2-FG produce different results when running with different number of tasks? I am noticing that the model finishes at different iterations and at different RMS. I even got the same results on the main branch without my changes. Is this consistent with your expectations?

I did notice in the FG commit you mention:

the block diagonal preconditioner detoriorates very quickly with the number of sub-blocks.

Is that what is occurring when running with different tasks?

@dong-hao
Copy link
Copy Markdown
Member

Thanks @MiCurry - that's very valuable information!

From the SP2 runs it is quite apparent that the disk I/O is taking its toll, for a relatively small problem. The performance difference is similar for 1 node (11:03), 2 nodes (7:59) and 3 nodes (8:58). I guess modern inter-node network bandwidth should be at least comparable to that of the disk I/O. Still, as the soln files for Cascadia are quite small, the memory savings of the write-to-disk approach are not that apparent. I would expect larger memory savings (and performance differences) for larger problems, though.

I do have a collection of runs with timing. It does appear that run-times are slower when running with -P. With the Cascadia case it is about 10 mins slower when running with SP2 with 21 cpus.

We generally save about ~100mb for the Cascadia case.

SP2 Runs - Cascadia Case

nodes ncpu cpus/node -P On/Off Time Total Memory (MB) iter = 1 Iterations
1 21 21 ON 01:06:31 14,571 MB 140
1 21 21 OFF 00:55:28 14,809 MB 140
2 21 10,11 ON 00:56:44 14,842 MB 140
2 21 10,11 OFF 00:48:45 15,080 MB 140
3 21 7 ON 00:56:14 14,840 MB 140
3 21 7 OFF 00:47:16 15,080 MB 140
SP2-FG Runs - Cascadia Case

Here are some SP2-FG runs. I have noticed though that they all finish at a different number of iterations.

nodes ncpu cpus/node -P On/Off Time Total Memory (MB) iter = 1 Iterations
1 64 64 ON 00:57:05 15,584 MB 131
1 64 64 OFF 00:50:07 15,700 MB 128
2 64 32 ON 00:24:22 16,534 MB 96
2 64 32 OFF 00:32:37 16,621 MB 150
4 64 16 ON 00:26:06 16,524 MB 135
4 64 16 OFF 00:17:24 16,545 MB 100
6 64 10 ON 00:30:05 16,640 MB 134
6 64 10 OFF 00:28:07 16,662 MB 139
@dong-hao does the SP2-FG produce different results when running with different number of tasks? I am noticing that the model finishes at different iterations and at different RMS. I even got the same results on the main branch without my changes. Is this consistent with your expectations?

I did notice in the FG commit you mention:

the block diagonal preconditioner detoriorates very quickly with the number of sub-blocks.

Is that what is occurring when running with different tasks?

For the SP2FG, it is expected that different parallel setup can yield different (but similar) results. This is due the way we solve the triangular precondition matrices (L/U). Cutting a triangular matrix into a few blocks may involve cutting off the "links" between those blocks (i.e. topology that extends across the blocks) and would decrease the efficiency of the preconditioner. Cutting those matrices in different ways (3 parts, 4 parts or 5 parts, etc.) will give you slightly different results. And you are right. preconditioner deteriorates very quickly with the number of sub-blocks - which is due to the cut-off of topology between blocks.

For the 64 cpus you are using, it may not be an optimal setup, as we have 20 transmitters. So in theory it will be 1 master, 17 groups of 3 workers and 3 groups of 4 workers on that node. Apparently the 64-cpu-setup is not much faster than 21-cpu-setup as the memory bandwidth should have long been saturated with that many cpus. To make that worse, I have a (bizarre) grouping strategy to change the group sizes according to the node topology to avoid splitting blocks from one transmitter across different nodes.

In that case, a one-node setup would be:
1 master + 17 groups of three + 3 groups of four = 1 + 51 + 12 = 64
a two-node setup would look like:
node1: 1 master + 9 groups of three + 1 groups of four = 1 + 27 + 4 = 32
node2: 8 groups of three + 2 groups of four = 30 + 2 = 32
and a four-node setup would be:
node1: 1 master + 5 groups of three = 1 + 15 = 16
node2: 4 groups of three + 1 group of four = 12 + 4 = 16
node3: 4 groups of three + 1 group of four = 12 + 4 = 16
node4: 4 groups of three + 1 group of four = 12 + 4 = 16
...

So you will find quite different grouping strategies for different node setup! And naturally those groups get assigned to different transmitters, and get different results. For example, the first group of four workers will need to calculate the 18th transmitter on the 1-node-setup. They will get to calculate the 10th transmitter in the 2-node setup...

On the other hand, the sequence of the accumulation operations among those blocks would become somehow randomized in parallel. i.e. you don't know which block will come first and which will come last. That would also lead to slightly different results in each iteration - and the accumulation of these small things may eventually lead to different outcomes.

That said - I think the number of iterations does look quite different - have you checked if the results (inversion models) look consistent?

@MiCurry
Copy link
Copy Markdown
Member Author

MiCurry commented Mar 26, 2026

Yes, the savings isn't too huge. It's the size of theeAll, SolnVectorMTX on the master task. I have actually created a little Jupyter Notebook that calculates the total size: https://gist.github.com/MiCurry/caa73aca50a681bbf28eab30a534780f

# Cascadia Case
Cascadia Case - Nx: 48 - Ny: 46 - Nz: 34 - nTx: 20 - nPol: 2
- solnVector Size (MB):              7.21 MB
- cVector Size (MB):                 3.60 MB
Total Savings (MB):                144.14 MB
# USMT Array - 0.25x0.25  - 136x284x46 
# nTransmitters: 2
- solnVector Size (MB):            166.21 MB
- cVector Size (MB):                83.11 MB
Total Savings (MB):                332.43 MB

# USMT Array - 0.25x0.25 - 136x284x46 
# nTransmitters: 15
- solnVector Size (MB):            166.21 MB
- cVector Size (MB):                83.11 MB
Total Savings (MB):              2,493.19 MB
# USMT Array Case 0.1x0.1 - 304x674x47 
# nTransmitters: 2
- solnVector Size (MB):            897.01 MB
- cVector Size (MB):               448.50 MB
Total Savings (MB):              1,794.01 MB

# USMT Array Case 0.1x0.1 - 304x674x47 
# nTransmitters: 15
- solnVector Size (MB):            897.01 MB
- cVector Size (MB):               448.50 MB
Total Savings (MB):             13,455.09 MB

2 GB, 1.7 GB and 13 GB are some nice numbers for memory savings, but for the Cascadia cases its only about ~1.6% total memory savings for ~20% slower results.

I am actually surprised that the time impact isn't worse, but I suppose 20% is quite a lot and will be worse for larger models.

Not sure exactly what to think of all this at the moment. I suppose the memory savings are nice, but there might be more significant memory savings elsewhere.

I'll respond more soon!

@dong-hao
Copy link
Copy Markdown
Member

Yes, the savings isn't too huge. It's the size of theeAll, SolnVectorMTX on the master task. I have actually created a little Jupyter Notebook that calculates the total size: https://gist.github.com/MiCurry/caa73aca50a681bbf28eab30a534780f

Thanks @MiCurry - that will be very useful!

I am actually surprised that the time impact isn't worse, but I suppose 20% is quite a lot and will be worse for larger models.

Not sure exactly what to think of all this at the moment. I suppose the memory savings are nice, but there might be more significant memory savings elsewhere.

Even in its current state, the -P mode gives us an important reference, and a choice. This is actually similar to the MUMPS project's "out of core memory" setup (which saves the off-process matrices on the disk, with significant speed impact). Actually I think the NCI people are originally inspired by the OOM idea...

We just need to let the users know about the trade-offs and let them decide which to use by themselves (stick with the eAll approach if they have large enough memory, etc).

On the other hand, the -P model will really shine when using ModEM on the controlled source method or ZTEM ( which I believe the NCI people did!). For MT, we will probably only deal with tens of transmitters, at most. This number can easily go beyond 1,000 (!) for CSEM. In that case, the eAll will be too expensive to save on a single node.

So let's think of it this way: it's an investment for the future ; )

This commit updates ModEM to allow the ability for labels to be passed to the
esoln manager from either NLCG or LBFGS.

First, this commit updates the ESolnManager so that it EsMgr_save and EsMgr_get
take in a label, a character string to label the current line search equation,
and a field type, to label either JmultT or electric fields (FWD) solnVectors.

The worker_job_task is updated to pass a new variable called label, which can
be used by worker tasks to correctly label their calculations (specifically
with the EsolnManager).

The Main_MPI routines, the INVCore routines are updated with pass through
labels, and then the line searches of both LBFGS and NLCG have been updated to
use a label when calling either func or gradient.

The labels represent the location of the line search, e.g. the trial, the
minimum etc. In this commit I have labeled these as F_1, for the trial, and F,
for the non trial. Depending on the line search, we label our forwards
calculations as F_1 or F and then use these in the gradient calculation later
on.

These labels allow the ESolnManager to write out its fields accordingly.
@MiCurry
Copy link
Copy Markdown
Member Author

MiCurry commented Mar 31, 2026

@dong-hao I have just pushed up a commit, eba6a9c, which adds the ability for the line searches to label the forward calculations and gradients liked you suggested.

The ESolnManager now requires a Electric field type (normally either JmultT or FWD) so the code can tell the difference between the two, and a label, which we used to identify line search calculations.

The gradient is always labeled the same as the penalty function calculation that was used to calculate it. This is mostly out of convince, rather than passing around two labels, we just pass the one label. The label we pass in the gradient calculation also tells the worker task which electric field it should read for the JmultT calculation.

I have also updated all the line searches in both NLCG and LBFGS and they all generate the same results with -P or without -P. It appears the NLCG Wolfe does not currently work? I am getting NaN's, I will create a Issue for it.

@dong-hao does all the labels look good to you? They should be easy to update.

@dong-hao
Copy link
Copy Markdown
Member

dong-hao commented Apr 1, 2026

Thanks @MiCurry for the update! I will try to test different inversion algorithms and report back.

@dong-hao I have just pushed up a commit, eba6a9c, which adds the ability for the line searches to label the forward calculations and gradients liked you suggested.

The ESolnManager now requires a Electric field type (normally either JmultT or FWD) so the code can tell the difference between the two, and a label, which we used to identify line search calculations.

The gradient is always labeled the same as the penalty function calculation that was used to calculate it. This is mostly out of convince, rather than passing around two labels, we just pass the one label. The label we pass in the gradient calculation also tells the worker task which electric field it should read for the JmultT calculation.

I have also updated all the line searches in both NLCG and LBFGS and they all generate the same results with -P or without -P. It appears the NLCG Wolfe does not currently work? I am getting NaN's, I will create a Issue for it.

@dong-hao does all the labels look good to you? They should be easy to update.

Sounds like a good setup to me - As for the NLCG Wolfe - have you tested that without the -P option? As I haven't really used the Wolfe with NLCG (probably for 5 years), I suspect there was a problem even before this implementation. Need to recall what I have done, before taking a look at that again...

@MiCurry
Copy link
Copy Markdown
Member Author

MiCurry commented Apr 1, 2026

@copilot please summarize the conversation in this PR! Thank you!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an optional “out-of-core” workflow for ModEM MPI runs where forward and adjoint E-field solutions can be persisted to disk (instead of being transferred and stored on the master), and adds plumbing to support reading those fields back during inversion steps.

Changes:

  • Add -P <prefix> CLI option and corresponding control flags to enable saving E-field/adjoint solutions to disk.
  • Introduce ESolnManager plus “placeholder” solution vectors/cvectors to avoid allocating full E-field storage on the master when file-backed mode is enabled.
  • Thread an optional label through forward/gradient/JmultT paths and line searches, and add MPI worker jobs for DATARESP and PQMULT in file-backed mode.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
f90/UserCtrl.f90 Adds -P parsing and help text for file-backed E-field storage.
f90/MPI/Main_MPI.f90 Integrates ESolnManager into forward/JmultT flows; adds DATARESP/PQMULT worker jobs and label propagation.
f90/MPI/Declaration_MPI.f90 Extends worker job payload with a label and adds modem_ctx MPI context struct.
f90/Mod3DMT.f90 Initializes ESolnManager after grid setup using modem_ctx and user control flags.
f90/INV/NLCG.f90 Adds labeled calls into func/gradient during line search.
f90/INV/LBFGS.f90 Adds labeled calls into func/gradient during line search.
f90/INV/INVcore.f90 Adds label to func/gradient and forwards it into MPI forward/JmultT.
f90/FIELDS/FiniteDiff3D/sg_vector.f90 Adds cvector placeholder support and more robust i/o error handling.
f90/3D_MT/Sub_MPI.f90 Adds get_nPol() helper for per-transmitter polarization count.
f90/3D_MT/SolnSpace.f90 Adds solnVector placeholder support and solnVector file read/write helpers.
f90/3D_MT/ESolnManager.f90 New module implementing MPI vs file-backed E-field transfer/save logic.
f90/3D_MT/CMakeLists.txt Adds ESolnManager to the build.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +173 to +196
k=1
nOptArgs = 0
search_arg: &
do
k=k+1
if (k .eq. narg) exit search_arg
call getarg(k, arg)

if(arg(1:1).eq.'-') then
! Set `optArgSize` to the number of argument parts: e.g. `-P [file_prefix]`
! has an optArgSize == 2, one part for the `-P` and one part for `[file_prefix]`.
if (arg(2:2).eq.'v') then
call getarg(k+1, verbose)
optArgSize = 2
end if
end do search_arg

if (arg(2:2) .eq. 'P') then
call getarg(k+1, ctrl % prefix)
ctrl % storeSolnsInFile = .true.
optArgSize = 2
end if

nOptArgs = nOptArgs + optArgSize
end if
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the optional-arg scan loop, optArgSize is only assigned for -v and -P. If any other -<flag> is encountered, nOptArgs = nOptArgs + optArgSize uses an uninitialized value and can corrupt narg.

Initialize optArgSize each iteration (e.g., to 0/1) and only increment nOptArgs when an option is recognized; also validate that k+1 <= iargc() before calling getarg(k+1, ...) for options that require a value.

Copilot uses AI. Check for mistakes.

call Unpack_e_para_vec(eAll_out%solns(which_per))
modem_ctx % comm_current = comm_current
call EsMgr_get(eAll_out % solns(which_per), which_pol, 1, from=who)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EsMgr_get arguments are swapped here: the 2nd positional argument is iTx, but which_pol is being passed. This will set eAll_out%solns(which_per)%tx incorrectly and can read/unpack the wrong solution.

Pass iTx=which_per and pol_index=which_pol (use keywords to avoid future mixups).

Suggested change
call EsMgr_get(eAll_out % solns(which_per), which_pol, 1, from=who)
call EsMgr_get(eAll_out % solns(which_per), which_per, 1, pol_index=which_pol, from=who)

Copilot uses AI. Check for mistakes.
call zero_solnvector(e0)
call zero_solnvector(e)

call EsMgr_create_e(e, per_index)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EsMgr_create_e(e, per_index) is called before per_index is assigned in this branch (the loop assigns per_index later). This can create e with the wrong transmitter metadata (e.g., wrong nPol/pol names), which will break subsequent file reads.

Create/resize e inside the do per_index = start_iTx, end_iTx loop (or call EsMgr_create_e with start_iTx just before the loop if you reuse the same allocation).

Suggested change
call EsMgr_create_e(e, per_index)
call EsMgr_create_e(e, start_iTx)

Copilot uses AI. Check for mistakes.
Comment on lines +2418 to +2419
call create_solnVector(grid, 1, e0)
call create_solnVector(grid, 1, e)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These create_solnVector calls hard-code iTx=1, but this PQMULT worker job processes per_index. If txDict(per_index)%nPol differs from txDict(1)%nPol, e0/e will be created with the wrong polarization count/names and subsequent EsMgr_get + PmultT/QmultT will be inconsistent.

Create the solnVectors with iTx=per_index (or use EsMgr_create_e) so the allocation matches the transmitter being processed.

Suggested change
call create_solnVector(grid, 1, e0)
call create_solnVector(grid, 1, e)
call create_solnVector(grid, per_index, e0)
call create_solnVector(grid, per_index, e)

Copilot uses AI. Check for mistakes.
Comment on lines +66 to +67
integer :: esmgr_holder_bytes
logical :: esmgr_holder_allocated
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

esmgr_holder_allocated is never initialized. Since module variables are undefined by default, int_create_e_param_place_holder may incorrectly early-return, leaving esmgr_holder/esmgr_holder_bytes unset and causing MPI pack/send/recv to use invalid buffers.

Initialize esmgr_holder_allocated = .false. (and esmgr_holder_bytes = 0) at declaration time, and consider resetting/freeing the buffer when grid/polarization sizes can change.

Suggested change
integer :: esmgr_holder_bytes
logical :: esmgr_holder_allocated
integer :: esmgr_holder_bytes = 0
logical :: esmgr_holder_allocated = .false.

Copilot uses AI. Check for mistakes.
Comment on lines +462 to +468
fname = construct_esoln_fname(prefix, e % tx, trim(e % pol_name(pol_index_lcl)))

if (.not. does_esoln_file_exist(e, prefix)) then
write(0,*) "ERROR: The file for this solnVector_t (argument e) does not exist"
write(0,*) "ERROR: Could not find: ", trim(fname)
call ModEM_abort()
end if
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reading a specific polarization, the existence check ignores pol_index_lcl (defaults to 1). This can incorrectly abort when pol 1 is missing but the requested pol exists, or pass when pol 1 exists but the requested pol file is missing.

Pass pol_index=pol_index_lcl into does_esoln_file_exist so the check matches fname.

Copilot uses AI. Check for mistakes.
Comment on lines +820 to +821
write(0,*) " USE STARTING UGESS"
stop
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch unconditionally prints and calls stop when a < 0 (negative curvature). That will terminate the inversion in legitimate scenarios where the quadratic fit has no minimum.

Remove the stop and keep the original fallback behavior (take the starting guess and proceed to gradient computation / exit line search).

Suggested change
write(0,*) " USE STARTING UGESS"
stop

Copilot uses AI. Check for mistakes.
integer :: pol_index_lcl

if (.not. e % allocated) then
write(0,*) "ERROR: solnVector_t (argument e) msut be allocated before calling write_solnVector"
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in error message: "msut" should be "must".

Suggested change
write(0,*) "ERROR: solnVector_t (argument e) msut be allocated before calling write_solnVector"
write(0,*) "ERROR: solnVector_t (argument e) must be allocated before calling write_solnVector"

Copilot uses AI. Check for mistakes.
Comment on lines +938 to +942
else if (istat > 0) then
write(0, '(A, A, A)') "ERROR: Erorr when reading E % x: ", trim(fname), ' aborting!'
write(0, '(A, i4.4, A, A)') 'ERROR: Error read iostat code:', istat, ' Message: ', trim(iomsg)
call ModEM_abort()
end if
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in error message: "Erorr" should be "Error".

Copilot uses AI. Check for mistakes.
call CdInvMult(res)
#ifdef MPI
call Master_job_JmultT(m,res,JTd,eAll)
call Master_job_JmultT(m,res,JTd,eAll,label=label)
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

label is an optional dummy argument, but it’s being passed through unconditionally as label=label here. There are multiple call sites that invoke gradient(...) without label (e.g., NLCG/LBFGS), which makes it illegal to reference/pass label and can lead to runtime failure.

Use label=label_lcl (since you already normalize it), or guard the call with if (present(label)) ... else ....

Suggested change
call Master_job_JmultT(m,res,JTd,eAll,label=label)
call Master_job_JmultT(m,res,JTd,eAll,label=label_lcl)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants