-
Notifications
You must be signed in to change notification settings - Fork 53
Description
The pharmcat_pipeline preprocessing step fails when encountering specific variant conditions during normalization:
- Multiple distinct variants normalizing to the same coordinate and allele, causing a
KeyError. - Symbolic alleles (e.g.,
<DEL>) being shifted to position1during normalization, resulting in unsorted VCFs and indexing failures.
KeyError issue
When processing data, the pipeline crashes during extract_pgx_variants.
Traceback:
File "/pharmcat/pcat/utilities.py", line 1097, in extract_pgx_variants
ref_pos_dynamic[input_chr_pos].pop(input_ref_alt)
KeyError: ('C', 'CAT')
Root Cause:
The input VCF contains two distinct variants that, after normalization, collide at the same position with the same alleles.
- Input Variant A:
chr2:233760233 C > CAT - Input Variant B:
chr2:233760234 A > ATA - Post-Normalization: Both variants are left-aligned to
chr2:233760233 C > CAT.
The internal dictionary ref_pos_dynamic appears unable to handle this many-to-one mapping during the .pop() operation.
Coordinate Corruption in Symbolic Alleles
When the VCF contains symbolic alleles, the pipeline produces an invalid (unsorted) VCF, preventing indexing.
Error Message:
[E :: hts_idx_push] Unsorted positions on sequence #1: 97173990 followed by 1
index: failed to create index for "/out/union_sorted.normalized.vcf.bgz"
Root Cause Analysis:
The normalization step handles symbolic alleles (specifically <DEL>) incorrectly. A variant at a valid genomic position is rewritten to position 1 with a reference allele of N.
- Pre-Normalization:
chr1:97175176 T > <DEL> - Post-Normalization:
chr1:1 N > <DEL>
This suggests the normalization logic is attempting to anchor or left-align a symbolic allele that lacks an explicit sequence, defaulting it to the start of the contig.
The environment:
pgkb/pharmcat Docker image, PharmCAT v3.1.1
bcftools 1.22
Using htslib 1.22
Command used:
docker run --rm \
-v ~/data:/input -v ~/out:/out \
pgkb/pharmcat pharmcat_pipeline -v \
-reporterCallsOnlyTsv -o /out /input/sample.vcf.gz