VCF parsing and summarization project for small-variant review workflows. The repository converts raw variant rows into gene-level and variant-type summaries, and it can emit stable tabular outputs for downstream inspection.
Variant call files are compact but not analyst-friendly. Before deeper interpretation, it is useful to normalize records into a tabular representation and produce quick summaries such as gene hit counts, impact levels, and mutation-type counts.
- Reads plain-text or gzipped VCF input.
- Requires a valid
#CHROMheader and rejects malformed short records. - Parses
INFOtags into structured fields such asGENE,IMPACT, andTRANSCRIPT. - Expands multi-allelic rows so each alternate allele becomes its own record.
- Classifies each variant as
SNP,insertion,deletion, orcomplex. - Exports normalized variants and summaries as
.csvor.json.
VCF / VCF.GZ input
|
v
Header validation + record parsing
|
v
INFO normalization
|
v
Pandas DataFrame
|
+--> normalized variant table
|
+--> gene-level summary
|
+--> variant-type summary
|
+--> machine-readable run report
src/parser.py: parsing logic, summarization, validation, and CLIdata/example.vcf: sample input with gene and impact annotationstests/test_parser.py: tests for parsing, gzip support, output writing, and header validation.github/workflows/ci.yml: automated test workflow
python src\parser.py --input data\example.vcf --summary-json outputs\summary.json --variants-out outputs\variants.csv --gene-summary-out outputs\gene_summary.csv --type-summary-out outputs\type_summary.jsonObserved CLI summary:
Summary by gene:
gene variant_count top_impact
BRCA1 2 MODERATE
EGFR 1 MODERATE
KRAS 1 HIGH
TP53 1 HIGH
Summary by variant type:
variant_type count
SNP 3
insertion 2
- The parser fails fast on invalid structure because silent record skipping undermines trust in genomic pipelines.
- Outputs are suffix-driven (
.csvor.json) so generated artifacts are predictable and automation-friendly. - Severity summary uses the most severe observed impact rather than the mode, which is a better default for triage workflows.
- This is a targeted parser for simple small-variant summaries, not a full VCF normalization or annotation engine.
- INFO extraction is key-based and currently focuses on a small set of tags.
- The sample dataset is intentionally compact for portability and test speed.
python -m pip install -r requirements.txt
python src\parser.py --input data\example.vcf --summary-json outputs\summary.json
python -m unittest discover -s tests