Parallel decompression for detector data by takluyver · Pull Request #593 · European-XFEL/EXtra-data

takluyver · 2025-02-07T09:19:47Z

Where detector data is stored compressed, HDF5 normally decompresses it in serial. This uses HDF5 to get the compressed chunk data, and then decompresses it in several worker threads. As I mentioned in Zulip, I experimented with a few ways of doing this, but the differences between them were minor.

I found some compressed (photonised) AGIPD data in MID proposal 6578. This shows loading 1000 frames across 16 modules. The value for 1 thread is the existing code path, with HDF5 doing the decompression. The others are all the new code path.

Questions:

I haven't yet implemented non-zero fill values for gaps. Is that worth doing, or shall we fall back to HDF5's code when we do that?
Make zlib_into (tiny package I made for this) a regular dependency or an optional one?
For now the parallel decompression is opt-in, by passing e.g. .ndarray(decompress_threads=16). Do we want to turn it on by default for compressed data? Or use some heuristics to determine when it's most likely to be useful?
Filling the output array one frame at a time opens up the possibility of making an array shaped like (frames, modules, slow_scan, fast_scan), instead of (modules, frames, ...), i.e. putting the different modules for one frame together in memory, which the current reading code does not allow. You could do this by making a (frames, modules, ...) array and then rearranging the axes to pass it as (modules, frames, ...) in the out= parameter. Do we want to do something to make that easier?

I have some ideas about speeding this up further, and I'd also like to get back to parallel reading of uncompressed data (as explored in #340). But I wanted to make some concrete progress on this.

extra_data/compression.py

takluyver · 2025-02-07T09:35:02Z

The test failures are because the simplest way to write zlib_into for now made it compatible with Python 3.11 and above. I can figure out how to extend support backwards if needed, but we might make it an optional dependency, or even bump our minimum Python version.

Here's a log-log plot (the same test set up as the plot above, but a different run on a different node):

JamesWrigley

I haven't yet implemented non-zero fill values for gaps. Is that worth doing, or shall we fall back to HDF5's code when we do that?

Which gaps do you mean?

Make zlib_into (tiny package I made for this) a regular dependency or an optional one?

A regular one so that we can use it by default ⏩

For now the parallel decompression is opt-in, by passing e.g. .ndarray(decompress_threads=16). Do we want to turn it on by default for compressed data? Or use some heuristics to determine when it's most likely to be useful?

Absolutely turn it on by default 😛 Which reminds me that we should advertise the multi-module keydata API since that's the only one getting the improvement.

Filling the output array one frame at a time opens up the possibility of making an array shaped like (frames, modules, slow_scan, fast_scan), instead of (modules, frames, ...), i.e. putting the different modules for one frame together in memory, which the current reading code does not allow. You could do this by making a (frames, modules, ...) array and then rearranging the axes to pass it as (modules, frames, ...) in the out= parameter. Do we want to do something to make that easier?

Could be useful, but I'd say let's cross that bridge when we come to it.

extra_data/compression.py

takluyver · 2025-02-07T11:09:39Z

Thanks!

we should advertise the multi-module keydata API since that's the only one getting the improvement.

Good point, that's also a question - it wouldn't be particularly difficult to allow this for generic KeyData objects too. Or we could use it as a selling point for the components interface (and maybe add it to generic KeyData later on).

takluyver · 2025-02-07T11:43:24Z

Which gaps do you mean?

I actually just realised that filling the gaps already works with the new code, because we set the fill value when creating the output array and then overwrite it.

For completeness, there are two kind of gaps this applies to.

If some modules are missing entirely and you pass module_gaps=True, space is left for the missing modules, so the array will still have length 16 (say) on the module dimension.
Where a train is recorded by some modules and not others, there are gaps for the missing modules by default. You can avoid this by setting min_modules=16 (or however many modules there are) when constructing the component.

JamesWrigley · 2025-02-07T15:37:01Z

Good point, that's also a question - it wouldn't be particularly difficult to allow this for generic KeyData objects too. Or we could use it as a selling point for the components interface (and maybe add it to generic KeyData later on).

That would actually be very nice because for some experiments we do indeed just need a single AGIPD module. But I think it can be left for later.

takluyver · 2025-02-10T17:30:53Z

I've enabled decompression with 16 threads by default for now. But I guess we should probably have some heuristics based on the number of available CPUs, and maybe also some way of imposing a limit externally, like an environment variable. 🤔

extra_data/components.py

extra_data/utils.py

JamesWrigley

LGTM!

takluyver · 2025-02-27T17:19:24Z

Thanks James! 😀

takluyver requested a review from JamesWrigley February 7, 2025 09:19

github-advanced-security bot found potential problems Feb 7, 2025

View reviewed changes

extra_data/compression.py Fixed Show fixed Hide fixed

extra_data/compression.py Fixed Show fixed Hide fixed

extra_data/compression.py Fixed Show fixed Hide fixed

JamesWrigley requested changes Feb 7, 2025

View reviewed changes

extra_data/compression.py Show resolved Hide resolved

takluyver added 6 commits February 10, 2025 12:58

Machinery for decompressing chunks

071c962

Integrate parallel decompression with xtdf detector components

fef9a2c

Add zlib_into to dependencies

dc96f46

Add decompress_threads parameter to .xarray() as well

ae91877

Make some None returns explicit

f0c6e1d

Use a newer OS & platform on RTD

7014d6e

takluyver force-pushed the parallel-decompress-cleanup branch from 2034734 to 7014d6e Compare February 10, 2025 12:58

Use parallel decompression in components by default (on suitable data)

4dd2291

Limit default number of decompression threads

a2e8428

github-advanced-security bot found potential problems Feb 11, 2025

View reviewed changes

extra_data/components.py Fixed Show fixed Hide fixed

extra_data/utils.py Fixed Show fixed Hide fixed

takluyver added 3 commits February 11, 2025 17:15

Handle chunk not allocated in file

7a7b7ef

Test parallel decompression machinery

a48da13

Some fixes from static code checks

d9423d4

takluyver added the enhancement New feature or request label Feb 12, 2025

takluyver requested a review from JamesWrigley February 14, 2025 13:18

takluyver added 3 commits February 27, 2025 11:08

Add some tests of compression machinery specifically

d10806d

Use a token for uploading coverage reports to codecov

7d909a7

Tell coverage we're using threads

66c5f67

JamesWrigley requested changes Feb 27, 2025

View reviewed changes

extra_data/utils.py Outdated Show resolved Hide resolved

Control thread pool with EXTRA_NUM_THREADS environment variable

2167dcd

takluyver force-pushed the parallel-decompress-cleanup branch from 7316368 to 2167dcd Compare February 27, 2025 15:43

JamesWrigley approved these changes Feb 27, 2025

View reviewed changes

takluyver merged commit 2f1123b into master Feb 27, 2025
10 checks passed

takluyver deleted the parallel-decompress-cleanup branch February 27, 2025 17:34

takluyver added this to the 1.21 milestone Mar 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel decompression for detector data#593

Parallel decompression for detector data#593
takluyver merged 15 commits intomasterfrom
parallel-decompress-cleanup

takluyver commented Feb 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

takluyver commented Feb 7, 2025 •

edited

Loading

Uh oh!

JamesWrigley left a comment •

edited

Loading

Uh oh!

Uh oh!

takluyver commented Feb 7, 2025

Uh oh!

takluyver commented Feb 7, 2025

Uh oh!

JamesWrigley commented Feb 7, 2025

Uh oh!

takluyver commented Feb 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JamesWrigley left a comment

Uh oh!

takluyver commented Feb 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

takluyver commented Feb 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

takluyver commented Feb 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JamesWrigley left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

takluyver commented Feb 7, 2025

Uh oh!

takluyver commented Feb 7, 2025

Uh oh!

JamesWrigley commented Feb 7, 2025

Uh oh!

takluyver commented Feb 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JamesWrigley left a comment

Choose a reason for hiding this comment

Uh oh!

takluyver commented Feb 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

takluyver commented Feb 7, 2025 •

edited

Loading

JamesWrigley left a comment •

edited

Loading