Fix legacy notebook edge cases exposed by MWE sample sheets, compressed FASTQ names, lookup remapping, and Marchenko k=1#1317
Merged
gaow merged 1 commit intoStatFunGen:mainfrom Apr 24, 2026
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR fixes three notebook-side edge cases in the legacy
code/notebooks:code/molecular_phenotypes/calling/RNA_calling.ipynbcode/data_preprocessing/genotype/GWAS_QC.ipynbcode/data_preprocessing/covariate/covariate_hidden_factor.ipynbThis is intentionally a notebook-only PR. It does not include
renovated_code/.1.
RNA_calling.ipynbWhat is fixed
This notebook had two independent brittle assumptions in the
fastqcpath.Sample-sheet parsing now follows the documented contract.
The notebook documents these columns:
IDfq1fq2strandread_lengthThe old code dropped
strandandread_length, then treated every remaining column afterIDas an input read file.This PR changes that logic to:
fq1and optionalfq2when those columns are presentID + 1/2 read-columnlayout whenfq1/fq2are absentFastQC output naming now handles both
fastqandfastq.gz.The old code expected output basenames using SoS
_input:bn, which preserved.fastqfor compressed inputs.This PR replaces that with explicit suffix stripping for:
.fastq.gz.fq.gz.fastq.fq.bam.sam.cramWhy it failed in the MWE
The first MWE failure was a real parsing error, not just a theoretical concern.
In
runs/old_snakemake_20260422_111759, the old notebook treated the extraparticipant_idcolumn in the MWE sample sheet as if it were a read filename, and generated FastQCcommands against:
mwe_data/fastq/sample1mwe_data/fastq/sample2FastQC then failed with:
SequenceFormatException: ID line didn't start with '@' at line 1Concrete files:
runs/old_snakemake_20260422_111759/ISSUES.mdruns/old_snakemake_20260422_111759/output/AC/molecular_phenotypes/fastqc/sample1_fastqc.stderrruns/old_snakemake_20260422_111759/output/AC/molecular_phenotypes/fastqc/sample2_fastqc.stderrAfter that was worked around with a cleaned sample list, the second MWE failure appeared:
sample1_r1.fastq_fastqc.htmlsample1_r1_fastqc.htmlThat happened because the MWE uses compressed inputs like:
sample1_r1.fastq.gzsample1_r2.fastq.gzWhy it worked earlier
These issues stayed hidden when older runs matched the notebook’s narrow assumptions:
ID.fastq, so the notebook’s expected FastQC basename happened to match what FastQC producedSo this is not a new FastQC regression. The MWE exposed two stale notebook assumptions:
IDis a read file”Why this should not undermine existing behavior
This change narrows behavior toward the notebook’s own documented interface.
fq1/ optionalfq2sample sheets continue to workID + one/two read-columnsheets still work.fastqinputs still resolve to the same FastQC output basenames.fastq.gzinputs now resolve correctly instead of failing2.
GWAS_QC.ipynbWhat is fixed
This PR fixes the lookup-based sample remapping logic in
genotype_phenotype_sample_overlap.The notebook now:
sample_id, genotype_idsample_id, participant_idgenotype_id,sample_idWhy it failed in the MWE
The old logic was internally inconsistent.
Old behavior:
sample1/sample2toGENO/GENO2sample_idvalues are still present in the renamed phenotype headerThat can drop all rows even when the lookup is valid.
Concrete MWE example:
runs/old_snakemake_20260422_111759/inputs/sample_to_genotype_lookup.tsvsample1 -> GENOsample2 -> GENO2After rename, phenotype columns become:
GENOGENO2But the old filter still tested:
sample1/sample2 %in colnames(phenoFile)That is false, so the overlap becomes empty.
This is recorded in:
runs/old_snakemake_20260422_111759/ISSUES.mdWhy it worked earlier
This bug only appears when lookup-based renaming is actually used.
Earlier runs would appear fine if:
So the MWE did not invent the bug. It exercised a real remapping path that the notebook had not handled consistently.
Why this should not undermine existing behavior
This change does not broaden the notebook arbitrarily. It makes the rename path self-consistent.
participant_idlookup naming is still accepted3.
covariate_hidden_factor.ipynbWhat is fixed
This PR adds
drop = FALSEto the Marchenko PC extraction step:residExpPC$rotated[, 1:MPPCNum]residExpPC$rotated[, 1:MPPCNum, drop = FALSE]Why it failed in the MWE
This failure only appears when Marchenko chooses exactly one hidden factor.
In the synthetic HG MWE, the phenotype input was tiny:
notebook_compare/input/route3_hg_synthetic.expression.bed.gzThat input had:
In that shape, Marchenko selected:
MPPCNum = 1When
MPPCNum == 1, the old R subset dropped from a matrix to a vector, and downstream code lost the expected matrix structure and sample labeling.Why it worked earlier
In more production-like expression matrices, Marchenko usually selects more than one factor.
When
MPPCNum > 1, the old code stays matrix-shaped and the bug remains hidden.So this is an MWE-shape-triggered failure, but it exposes a real latent notebook bug.
Why this should not undermine existing behavior
This is the narrowest possible fix.
MPPCNum > 1MPPCNum == 1, the code now preserves the expected 2D structure instead of dropping dimensionsScope
This PR is limited to notebook fixes under
code/.It does not include:
renovated_code/