read_Z_list is extremely inefficient O(n^2), and extremely slow for large data files

Working with the USMT Array data files I have noticed that the `read_Z_list` function in `DataIO` is extremely slow. The USMT array datafiles I have are:

* 2 periods, 114,130 stations - Full Impedance
* 15 periods, 1616 stations - Full Impedance

Upon inspecting the code it appears that read_z_list that the full data read is O(N^2)!

# Issue Break Down

## Code Break Down

Some are the loops are a bit hidden, but here is the break down:

    READ_DATA_TYPE: do

https://github.com/MiCurry/ModEM/blob/cfc234e2c22290d95a222090b22fc281cddbc8e3/f90/3D_MT/DataIO.f90#L610

the `READ_DATA_TYPE` loop only loops once per data type. So thankfully, it's iterations can be ignored.

https://github.com/MiCurry/ModEM/blob/cfc234e2c22290d95a222090b22fc281cddbc8e3/f90/3D_MT/DataIO.f90#L698

The inner-loop, the `READ_DATA_LINE` loop, loops `nPeriods * nStations * nComponents`, which is to be expected. This equates the total number of lines of a single datatype file. We can consider this to be `N`. In the above 2 period case this is 913,040 data entires.

This loop is sneaky though, and is actually `2 * N`. Why? The `[usages of backspace](            case(Full_Impedance,Off_Diagonal_Impedance,Full_Vertical_Components))`. We read the line in to count the line length for Azimuth determination, call backspace, and then read the line into the data components!

https://github.com/MiCurry/ModEM/blob/cfc234e2c22290d95a222090b22fc281cddbc8e3/f90/3D_MT/DataIO.f90#L885

Inside `update_TxDict` we loop through the total number of periods, `nPeriods` to determine where the period should go

https://github.com/MiCurry/ModEM/blob/cfc234e2c22290d95a222090b22fc281cddbc8e3/f90/3D_MT/DataIO.f90#L898

Inside `update_rXDict` each data entry loops through the list of stations twice, once to count the total number of stations `count_rx()`, then we loop through from 1 to `count_rx()` again. 

Thus, with `update_TxDict` and `update_rXDict` we get O(n^2).

## Impact of Backspace

While backspace only adds a 2 * N to the runtime, which we drop when calculating big O, it probably produces more impact as it doubles the amount of reads *for each task*. With such a large file, this might hammer IO causing IO slowdown. 

## Memory

It appears the memory after calling this routine is quite large as well, perhaps larger than it needs to be, I have not yet done a breakdown, but we might be allocating too much memory here then necessary. However, I could be wrong here.

# The Fix

## Hash Map for Station IDs

We can most likely use a hash-map of the station IDs. This will give us constant O(1) lookup, but will require an extra array of nStations that points to the rx object.

## Hash Map For periods

We also could use a hash-map for the periods. 

## Don't use Backspace

Rather than using backspace and re-reading, we can perform our first read into a variable, and then update the second read, where we read into each variable (Period, Moment, x, y, z etc.) to read from the first variable. Should be pretty easy.

## Memory

Calculate how much memory we should actually be storing for nStations, nPeriods, and each specific data type and see if that is the case. ModEM should have more memory then the size of the actual file, since we use 8 bytes reals, and the data files don't use such precision.

## Use NetCDF / HDF5

Another potential, but a larger lift, would be switching to NetCDF/HDF5 and let those libraries handle reading efficiently. However we would need to write tooling to convert datafiles to NetCDF/HDF5 and of course implement a NetCDF library. Perhaps looking into using the HDF5 code more created by Spencer would be worthwhile for this.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_Z_list is extremely inefficient O(n^2), and extremely slow for large data files #48

Issue Break Down

Code Break Down

Impact of Backspace

Memory

The Fix

Hash Map for Station IDs

Hash Map For periods

Don't use Backspace

Memory

Use NetCDF / HDF5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

read_Z_list is extremely inefficient O(n^2), and extremely slow for large data files #48

Description

Issue Break Down

Code Break Down

Impact of Backspace

Memory

The Fix

Hash Map for Station IDs

Hash Map For periods

Don't use Backspace

Memory

Use NetCDF / HDF5

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions