Skip to content

Parallel VCF ingest via tabix region splitting #74

@vineetver

Description

@vineetver

VCF is a row-oriented text format inside a compressed stream. Parsing is sequential because record N's start depends on record N-1's end. BGZF decompression is already parallel (noodles-bgzf worker threads), but the parse itself is single-threaded.

Tabix-indexed VCFs (.tbi or .csi sidecar) support random access by genomic region. When an index is present we can split the file into non-overlapping chromosome regions and parse them in parallel threads, each writing to its own per-chromosome batch.

Behavior

  1. On favor ingest, probe for <path>.tbi or <path>.csi alongside each input VCF.
  2. If an index exists, read the region list from the index header (chromosomes and their block offsets).
  3. Partition regions into N worker groups based on the memory budget (each worker needs one batch buffer per chromosome it touches).
  4. Each worker opens an independent BGZF reader, seeks to its region range, parses records, and fills thread-local batch builders.
  5. A coordinator thread collects full batches from workers and flushes to the per-chromosome parquet writers (single writer per chromosome, fed by multiple parse threads).
  6. If no index is present, fall back to the current single-threaded sequential parse.

Constraints

  • Memory budget must account for N workers x batch buffers. Derive worker count from budget.
  • Each worker decompresses independently, so total BGZF throughput scales with workers.
  • Region boundaries must align to BGZF block starts (tabix guarantees this).
  • The single-pass genotype extraction path (geno_writer) must still work. Either each worker has its own GenotypeWriter and results merge, or genotype extraction stays single-threaded and only variant-site parsing parallelizes.

Depends on

Metadata

Metadata

Assignees

No one assigned

    Labels

    ingestVCF/genotype ingest pipelineperformanceOptimization and profiling

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions