feat: Implementing parallelization for unzipping files #1114

Roaimkhan · 2026-01-25T16:21:16Z

Description

This PR implements GNU Parallel-based unzipping for 200,000+ *.cif.gz files in the AlphaFold pipeline.

The Problem:

The current unzipping process is strictly serial, which is extremely slow for large datasets. This limits efficiency and delays downstream processing.

The Fix:

Added a check for GNU Parallel availability.

Automatically detects the number of CPU cores, leaving one core free for I/O-bound tasks and using the remaining cores for parallel unzipping.

Falls back to the existing serial method if GNU Parallel is not installed.

Updated README.md to reflect the new parallelization option and usage instructions.

Fixes: #1075

Vincy1230

I noticed the same issue as you - some servers may require up to ten or more hours for the initial download stage when only utilizing one CPU core.

Deeply grateful for your pull request.

Vincy1230

I noticed that there is still room for optimization in your changes: you added parallel to enable gzip to run concurrently, but doing so loses the ability of find ... {} + to pass as many files as possible at once, which causes the new script to incur the overhead of starting and stopping gzip frequently.

How about adding a --xargs option as well, so we can get the best of both approaches?

Vincy1230 · 2026-02-11T12:40:53Z

scripts/download_pdb_mmcif.sh

-find "${RAW_DIR}/" -type f -iname "*.gz" -exec gunzip {} +
+if command -v parallel >/dev/null 2>&1
+then
+   find "${RAW_DIR}/" -type f -iname "*.gz" -print0 | parallel -0 -j -1 gunzip


Suggested change

find "${RAW_DIR}/" -type f -iname "*.gz" -print0 | parallel -0 -j -1 gunzip

find "${RAW_DIR}/" -type f -iname "*.gz" -print0 | parallel -0 -j -1 --xargs gunzip

feat: Implementing parallelization for unzipping files

ec68cdb

Vincy1230 approved these changes Feb 11, 2026

View reviewed changes

Vincy1230 reviewed Feb 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implementing parallelization for unzipping files #1114

feat: Implementing parallelization for unzipping files #1114

Roaimkhan commented Jan 25, 2026

Uh oh!

Vincy1230 left a comment

Uh oh!

Vincy1230 left a comment

Uh oh!

Vincy1230 Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	find "${RAW_DIR}/" -type f -iname "*.gz" -print0 \| parallel -0 -j -1 gunzip
	find "${RAW_DIR}/" -type f -iname "*.gz" -print0 \| parallel -0 -j -1 --xargs gunzip

feat: Implementing parallelization for unzipping files #1114

Are you sure you want to change the base?

feat: Implementing parallelization for unzipping files #1114

Conversation

Roaimkhan commented Jan 25, 2026

Description

The Problem:

The Fix:

Uh oh!

Vincy1230 left a comment

Choose a reason for hiding this comment

Uh oh!

Vincy1230 left a comment

Choose a reason for hiding this comment

Uh oh!

Vincy1230 Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants