Skip to content

Preprocessing Failure: KeyError and Coordinate Corruption (Pos 1) during VCF Normalization #222

@lbombini

Description

@lbombini

The pharmcat_pipeline preprocessing step fails when encountering specific variant conditions during normalization:

  1. Multiple distinct variants normalizing to the same coordinate and allele, causing a KeyError.
  2. Symbolic alleles (e.g., <DEL>) being shifted to position 1 during normalization, resulting in unsorted VCFs and indexing failures.

KeyError issue

When processing data, the pipeline crashes during extract_pgx_variants.

Traceback:

  File "/pharmcat/pcat/utilities.py", line 1097, in extract_pgx_variants
    ref_pos_dynamic[input_chr_pos].pop(input_ref_alt)
KeyError: ('C', 'CAT')

Root Cause:
The input VCF contains two distinct variants that, after normalization, collide at the same position with the same alleles.

  • Input Variant A: chr2:233760233 C > CAT
  • Input Variant B: chr2:233760234 A > ATA
  • Post-Normalization: Both variants are left-aligned to chr2:233760233 C > CAT.

The internal dictionary ref_pos_dynamic appears unable to handle this many-to-one mapping during the .pop() operation.

Coordinate Corruption in Symbolic Alleles

When the VCF contains symbolic alleles, the pipeline produces an invalid (unsorted) VCF, preventing indexing.

Error Message:

[E :: hts_idx_push] Unsorted positions on sequence #1: 97173990 followed by 1
index: failed to create index for "/out/union_sorted.normalized.vcf.bgz"

Root Cause Analysis:
The normalization step handles symbolic alleles (specifically <DEL>) incorrectly. A variant at a valid genomic position is rewritten to position 1 with a reference allele of N.

  • Pre-Normalization: chr1:97175176 T > <DEL>
  • Post-Normalization: chr1:1 N > <DEL>

This suggests the normalization logic is attempting to anchor or left-align a symbolic allele that lacks an explicit sequence, defaulting it to the start of the contig.


The environment:

pgkb/pharmcat Docker image, PharmCAT v3.1.1
bcftools 1.22
Using htslib 1.22

Command used:

docker run --rm \
    -v ~/data:/input -v ~/out:/out \
    pgkb/pharmcat pharmcat_pipeline -v \
    -reporterCallsOnlyTsv -o /out /input/sample.vcf.gz

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions