Merged
Conversation
Pull VCF reading behind a small VariantReader trait so new formats (GDS, BGEN) can plug in without touching the processing core. Collapses the duplicate record loops in ingest/vcf.rs and staar/genotype.rs into one RecordContext path. FormatHandler::open_reader is the extension point. Invariance golden and the ground-truth-vs-R suite bit-identical. Closes #87.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pulls VCF reading behind a small trait so the processing core stops
caring about format. Collapses the two near-duplicate record loops
(
ingest/vcf.rs::process_recordandstaar/genotype.rs::process_record_geno)into one
RecordContext::processpath.FormatHandler::open_readeris the extension point for new readers.
Trait uses a
for_eachcallback instead of a lending iterator:noodles 0.73's field wrappers (
AlternateBases,Samples, ...)expose their inner
&'r stronly throughAsRef::as_ref(&self)with elided lifetime, so
next_recordwould have forced either~225 GB of memcpy per UKB-scale cohort or
ouroboros/unsafeself-references. Callback form keeps wrapper locals alive inside
the reader's loop, zero copies.
Invariance golden and all ground-truth-vs-R tests bit-identical.
339/339 cargo test, clippy clean with
-D warnings.Side finding (not fixed here)
Writing the reader unit test surfaced a pre-existing MAF bug:
noodles_vcf::Record::samples()includes the FORMAT column(
"GT\t0/0\t0/1"), andGenotypeWriter::push's memchr loopconsumes it as sample[0]. Slot 0 gets a spurious 0-dosage,
all real samples shift by one, the last sample is dropped.
Invariance is tautologically green (golden regenerated under the
same bug); ground-truth tests feed JSON fixtures so they miss it.
Affects every real VCF→MAF run. Separate ticket coming; fix is
trivial (strip FORMAT via one tab-split) but needs the golden
regenerated in the same commit.
Closes #87.