VCF is a row-oriented text format inside a compressed stream. Parsing is sequential because record N's start depends on record N-1's end. BGZF decompression is already parallel (noodles-bgzf worker threads), but the parse itself is single-threaded.
Tabix-indexed VCFs (.tbi or .csi sidecar) support random access by genomic region. When an index is present we can split the file into non-overlapping chromosome regions and parse them in parallel threads, each writing to its own per-chromosome batch.
Behavior
- On
favor ingest, probe for <path>.tbi or <path>.csi alongside each input VCF.
- If an index exists, read the region list from the index header (chromosomes and their block offsets).
- Partition regions into N worker groups based on the memory budget (each worker needs one batch buffer per chromosome it touches).
- Each worker opens an independent BGZF reader, seeks to its region range, parses records, and fills thread-local batch builders.
- A coordinator thread collects full batches from workers and flushes to the per-chromosome parquet writers (single writer per chromosome, fed by multiple parse threads).
- If no index is present, fall back to the current single-threaded sequential parse.
Constraints
- Memory budget must account for N workers x batch buffers. Derive worker count from budget.
- Each worker decompresses independently, so total BGZF throughput scales with workers.
- Region boundaries must align to BGZF block starts (tabix guarantees this).
- The single-pass genotype extraction path (
geno_writer) must still work. Either each worker has its own GenotypeWriter and results merge, or genotype extraction stays single-threaded and only variant-site parsing parallelizes.
Depends on
VCF is a row-oriented text format inside a compressed stream. Parsing is sequential because record N's start depends on record N-1's end. BGZF decompression is already parallel (noodles-bgzf worker threads), but the parse itself is single-threaded.
Tabix-indexed VCFs (.tbi or .csi sidecar) support random access by genomic region. When an index is present we can split the file into non-overlapping chromosome regions and parse them in parallel threads, each writing to its own per-chromosome batch.
Behavior
favor ingest, probe for<path>.tbior<path>.csialongside each input VCF.Constraints
geno_writer) must still work. Either each worker has its own GenotypeWriter and results merge, or genotype extraction stays single-threaded and only variant-site parsing parallelizes.Depends on