Skip to content

Conversation

@Roaimkhan
Copy link

Description

This PR implements GNU Parallel-based unzipping for 200,000+ *.cif.gz files in the AlphaFold pipeline.

The Problem:

The current unzipping process is strictly serial, which is extremely slow for large datasets. This limits efficiency and delays downstream processing.

The Fix:

Added a check for GNU Parallel availability.

Automatically detects the number of CPU cores, leaving one core free for I/O-bound tasks and using the remaining cores for parallel unzipping.

Falls back to the existing serial method if GNU Parallel is not installed.

Updated README.md to reflect the new parallelization option and usage instructions.

Fixes: #1075

Copy link

@Vincy1230 Vincy1230 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed the same issue as you - some servers may require up to ten or more hours for the initial download stage when only utilizing one CPU core.

Deeply grateful for your pull request.

Copy link

@Vincy1230 Vincy1230 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that there is still room for optimization in your changes: you added parallel to enable gzip to run concurrently, but doing so loses the ability of find ... {} + to pass as many files as possible at once, which causes the new script to incur the overhead of starting and stopping gzip frequently.

How about adding a --xargs option as well, so we can get the best of both approaches?

find "${RAW_DIR}/" -type f -iname "*.gz" -exec gunzip {} +
if command -v parallel >/dev/null 2>&1
then
find "${RAW_DIR}/" -type f -iname "*.gz" -print0 | parallel -0 -j -1 gunzip

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
find "${RAW_DIR}/" -type f -iname "*.gz" -print0 | parallel -0 -j -1 gunzip
find "${RAW_DIR}/" -type f -iname "*.gz" -print0 | parallel -0 -j -1 --xargs gunzip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

parallelization opportunity

2 participants