Working with the USMT Array data files I have noticed that the read_Z_list function in DataIO is extremely slow. The USMT array datafiles I have are:
- 2 periods, 114,130 stations - Full Impedance
- 15 periods, 1616 stations - Full Impedance
Upon inspecting the code it appears that read_z_list that the full data read is O(N^2)!
Issue Break Down
Code Break Down
Some are the loops are a bit hidden, but here is the break down:
https://github.com/MiCurry/ModEM/blob/cfc234e2c22290d95a222090b22fc281cddbc8e3/f90/3D_MT/DataIO.f90#L610
the READ_DATA_TYPE loop only loops once per data type. So thankfully, it's iterations can be ignored.
https://github.com/MiCurry/ModEM/blob/cfc234e2c22290d95a222090b22fc281cddbc8e3/f90/3D_MT/DataIO.f90#L698
The inner-loop, the READ_DATA_LINE loop, loops nPeriods * nStations * nComponents, which is to be expected. This equates the total number of lines of a single datatype file. We can consider this to be N. In the above 2 period case this is 913,040 data entires.
This loop is sneaky though, and is actually 2 * N. Why? The [usages of backspace]( case(Full_Impedance,Off_Diagonal_Impedance,Full_Vertical_Components)). We read the line in to count the line length for Azimuth determination, call backspace, and then read the line into the data components!
https://github.com/MiCurry/ModEM/blob/cfc234e2c22290d95a222090b22fc281cddbc8e3/f90/3D_MT/DataIO.f90#L885
Inside update_TxDict we loop through the total number of periods, nPeriods to determine where the period should go
https://github.com/MiCurry/ModEM/blob/cfc234e2c22290d95a222090b22fc281cddbc8e3/f90/3D_MT/DataIO.f90#L898
Inside update_rXDict each data entry loops through the list of stations twice, once to count the total number of stations count_rx(), then we loop through from 1 to count_rx() again.
Thus, with update_TxDict and update_rXDict we get O(n^2).
Impact of Backspace
While backspace only adds a 2 * N to the runtime, which we drop when calculating big O, it probably produces more impact as it doubles the amount of reads for each task. With such a large file, this might hammer IO causing IO slowdown.
Memory
It appears the memory after calling this routine is quite large as well, perhaps larger than it needs to be, I have not yet done a breakdown, but we might be allocating too much memory here then necessary. However, I could be wrong here.
The Fix
Hash Map for Station IDs
We can most likely use a hash-map of the station IDs. This will give us constant O(1) lookup, but will require an extra array of nStations that points to the rx object.
Hash Map For periods
We also could use a hash-map for the periods.
Don't use Backspace
Rather than using backspace and re-reading, we can perform our first read into a variable, and then update the second read, where we read into each variable (Period, Moment, x, y, z etc.) to read from the first variable. Should be pretty easy.
Memory
Calculate how much memory we should actually be storing for nStations, nPeriods, and each specific data type and see if that is the case. ModEM should have more memory then the size of the actual file, since we use 8 bytes reals, and the data files don't use such precision.
Use NetCDF / HDF5
Another potential, but a larger lift, would be switching to NetCDF/HDF5 and let those libraries handle reading efficiently. However we would need to write tooling to convert datafiles to NetCDF/HDF5 and of course implement a NetCDF library. Perhaps looking into using the HDF5 code more created by Spencer would be worthwhile for this.
Working with the USMT Array data files I have noticed that the
read_Z_listfunction inDataIOis extremely slow. The USMT array datafiles I have are:Upon inspecting the code it appears that read_z_list that the full data read is O(N^2)!
Issue Break Down
Code Break Down
Some are the loops are a bit hidden, but here is the break down:
https://github.com/MiCurry/ModEM/blob/cfc234e2c22290d95a222090b22fc281cddbc8e3/f90/3D_MT/DataIO.f90#L610
the
READ_DATA_TYPEloop only loops once per data type. So thankfully, it's iterations can be ignored.https://github.com/MiCurry/ModEM/blob/cfc234e2c22290d95a222090b22fc281cddbc8e3/f90/3D_MT/DataIO.f90#L698
The inner-loop, the
READ_DATA_LINEloop, loopsnPeriods * nStations * nComponents, which is to be expected. This equates the total number of lines of a single datatype file. We can consider this to beN. In the above 2 period case this is 913,040 data entires.This loop is sneaky though, and is actually
2 * N. Why? The[usages of backspace]( case(Full_Impedance,Off_Diagonal_Impedance,Full_Vertical_Components)). We read the line in to count the line length for Azimuth determination, call backspace, and then read the line into the data components!https://github.com/MiCurry/ModEM/blob/cfc234e2c22290d95a222090b22fc281cddbc8e3/f90/3D_MT/DataIO.f90#L885
Inside
update_TxDictwe loop through the total number of periods,nPeriodsto determine where the period should gohttps://github.com/MiCurry/ModEM/blob/cfc234e2c22290d95a222090b22fc281cddbc8e3/f90/3D_MT/DataIO.f90#L898
Inside
update_rXDicteach data entry loops through the list of stations twice, once to count the total number of stationscount_rx(), then we loop through from 1 tocount_rx()again.Thus, with
update_TxDictandupdate_rXDictwe get O(n^2).Impact of Backspace
While backspace only adds a 2 * N to the runtime, which we drop when calculating big O, it probably produces more impact as it doubles the amount of reads for each task. With such a large file, this might hammer IO causing IO slowdown.
Memory
It appears the memory after calling this routine is quite large as well, perhaps larger than it needs to be, I have not yet done a breakdown, but we might be allocating too much memory here then necessary. However, I could be wrong here.
The Fix
Hash Map for Station IDs
We can most likely use a hash-map of the station IDs. This will give us constant O(1) lookup, but will require an extra array of nStations that points to the rx object.
Hash Map For periods
We also could use a hash-map for the periods.
Don't use Backspace
Rather than using backspace and re-reading, we can perform our first read into a variable, and then update the second read, where we read into each variable (Period, Moment, x, y, z etc.) to read from the first variable. Should be pretty easy.
Memory
Calculate how much memory we should actually be storing for nStations, nPeriods, and each specific data type and see if that is the case. ModEM should have more memory then the size of the actual file, since we use 8 bytes reals, and the data files don't use such precision.
Use NetCDF / HDF5
Another potential, but a larger lift, would be switching to NetCDF/HDF5 and let those libraries handle reading efficiently. However we would need to write tooling to convert datafiles to NetCDF/HDF5 and of course implement a NetCDF library. Perhaps looking into using the HDF5 code more created by Spencer would be worthwhile for this.