Skip to content

Add only parallel changes#427

Draft
FerriolCalvet wants to merge 45 commits intodevfrom
dev-chunk-only-parallel
Draft

Add only parallel changes#427
FerriolCalvet wants to merge 45 commits intodevfrom
dev-chunk-only-parallel

Conversation

@FerriolCalvet
Copy link
Collaborator

No description provided.

migrau and others added 14 commits July 2, 2025 16:12
Implemented parallel processing of VEP annotation through configurable chunking:

- Added `panel_sites_chunk_size` parameter (default: 0, no chunking)
  - When >0, splits sites file into chunks for parallel VEP annotation
  - Uses bash `split` command for efficient chunking with preserved headers

- Modified SITESFROMPOSITIONS module:
  - Outputs multiple chunk files (*.sites4VEP.chunk*.tsv) instead of single file
  - Logs chunk configuration and number of chunks created
  - Chunk size configurable via `ext.chunk_size` in modules.config

- Updated CREATE_PANELS workflow:
  - Flattens chunks with `.transpose()` for parallel processing
  - Each chunk gets unique ID for VEP tracking
  - Merges chunks using `collectFile` with header preservation

- Added SORT_MERGED_PANEL module:
  - Sorts merged panels by chromosome and position (genomic order)
  - Prevents "out of order" errors in downstream BED operations
  - Applied to both compact and rich annotation outputs

- Enhanced logging across chunking pipeline:
  - SITESFROMPOSITIONS: reports chunk_size and number of chunks created
  - POSTPROCESS_VEP_ANNOTATION: shows internal chunk_size and expected chunks
  - CUSTOM_ANNOTATION_PROCESSING: displays chr_chunk_size and processing info

Configuration:
  - `panel_sites_chunk_size`: controls file chunking (0=disabled)
  - `panel_postprocessing_chunk_size`: internal memory management
  - `panel_custom_processing_chunk_size`: internal chromosome chunking

Benefits:
  - Parallelizes VEP annotation for large panels
  - Reduces memory footprint per task
  - Maintains genomic sort order for downstream tools
@FerriolCalvet FerriolCalvet changed the title Dev chunk only parallel Add only parallel changes Feb 27, 2026
- remove all optimization related updates (to be added later)
@FerriolCalvet FerriolCalvet added this to the Current iteration milestone Feb 27, 2026
@FerriolCalvet FerriolCalvet linked an issue Feb 27, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Large memory usage by panel_postprocessing_annotation.py

2 participants