Skip to content

read_Z_list is extremely inefficient O(n^2), and extremely slow for large data files #48

@MiCurry

Description

@MiCurry

Working with the USMT Array data files I have noticed that the read_Z_list function in DataIO is extremely slow. The USMT array datafiles I have are:

  • 2 periods, 114,130 stations - Full Impedance
  • 15 periods, 1616 stations - Full Impedance

Upon inspecting the code it appears that read_z_list that the full data read is O(N^2)!

Issue Break Down

Code Break Down

Some are the loops are a bit hidden, but here is the break down:

READ_DATA_TYPE: do

https://github.com/MiCurry/ModEM/blob/cfc234e2c22290d95a222090b22fc281cddbc8e3/f90/3D_MT/DataIO.f90#L610

the READ_DATA_TYPE loop only loops once per data type. So thankfully, it's iterations can be ignored.

https://github.com/MiCurry/ModEM/blob/cfc234e2c22290d95a222090b22fc281cddbc8e3/f90/3D_MT/DataIO.f90#L698

The inner-loop, the READ_DATA_LINE loop, loops nPeriods * nStations * nComponents, which is to be expected. This equates the total number of lines of a single datatype file. We can consider this to be N. In the above 2 period case this is 913,040 data entires.

This loop is sneaky though, and is actually 2 * N. Why? The [usages of backspace]( case(Full_Impedance,Off_Diagonal_Impedance,Full_Vertical_Components)). We read the line in to count the line length for Azimuth determination, call backspace, and then read the line into the data components!

https://github.com/MiCurry/ModEM/blob/cfc234e2c22290d95a222090b22fc281cddbc8e3/f90/3D_MT/DataIO.f90#L885

Inside update_TxDict we loop through the total number of periods, nPeriods to determine where the period should go

https://github.com/MiCurry/ModEM/blob/cfc234e2c22290d95a222090b22fc281cddbc8e3/f90/3D_MT/DataIO.f90#L898

Inside update_rXDict each data entry loops through the list of stations twice, once to count the total number of stations count_rx(), then we loop through from 1 to count_rx() again.

Thus, with update_TxDict and update_rXDict we get O(n^2).

Impact of Backspace

While backspace only adds a 2 * N to the runtime, which we drop when calculating big O, it probably produces more impact as it doubles the amount of reads for each task. With such a large file, this might hammer IO causing IO slowdown.

Memory

It appears the memory after calling this routine is quite large as well, perhaps larger than it needs to be, I have not yet done a breakdown, but we might be allocating too much memory here then necessary. However, I could be wrong here.

The Fix

Hash Map for Station IDs

We can most likely use a hash-map of the station IDs. This will give us constant O(1) lookup, but will require an extra array of nStations that points to the rx object.

Hash Map For periods

We also could use a hash-map for the periods.

Don't use Backspace

Rather than using backspace and re-reading, we can perform our first read into a variable, and then update the second read, where we read into each variable (Period, Moment, x, y, z etc.) to read from the first variable. Should be pretty easy.

Memory

Calculate how much memory we should actually be storing for nStations, nPeriods, and each specific data type and see if that is the case. ModEM should have more memory then the size of the actual file, since we use 8 bytes reals, and the data files don't use such precision.

Use NetCDF / HDF5

Another potential, but a larger lift, would be switching to NetCDF/HDF5 and let those libraries handle reading efficiently. However we would need to write tooling to convert datafiles to NetCDF/HDF5 and of course implement a NetCDF library. Perhaps looking into using the HDF5 code more created by Spencer would be worthwhile for this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions