Fix legacy notebook edge cases exposed by MWE sample sheets, compressed FASTQ names, lookup remapping, and Marchenko k=1 by hsun3163 · Pull Request #1317 · StatFunGen/xqtl-protocol

hsun3163 · 2026-04-23T04:00:19Z

Summary

This PR fixes three notebook-side edge cases in the legacy code/ notebooks:

code/molecular_phenotypes/calling/RNA_calling.ipynb
code/data_preprocessing/genotype/GWAS_QC.ipynb
code/data_preprocessing/covariate/covariate_hidden_factor.ipynb

This is intentionally a notebook-only PR. It does not include renovated_code/.

1. `RNA_calling.ipynb`

What is fixed

This notebook had two independent brittle assumptions in the fastqc path.

Sample-sheet parsing now follows the documented contract.
The notebook documents these columns:
- ID
- fq1
- optional fq2
- optional strand
- optional read_length
The old code dropped strand and read_length, then treated every remaining column after ID as an input read file.

This PR changes that logic to:
- use fq1 and optional fq2 when those columns are present
- only fall back to a legacy ID + 1/2 read-column layout when fq1/fq2 are absent
FastQC output naming now handles both fastq and fastq.gz.
The old code expected output basenames using SoS _input:bn, which preserved .fastq for compressed inputs.
This PR replaces that with explicit suffix stripping for:
- .fastq.gz
- .fq.gz
- .fastq
- .fq
- .bam
- .sam
- .cram

Why it failed in the MWE

The first MWE failure was a real parsing error, not just a theoretical concern.

In runs/old_snakemake_20260422_111759, the old notebook treated the extra participant_id column in the MWE sample sheet as if it were a read filename, and generated FastQC
commands against:

mwe_data/fastq/sample1
mwe_data/fastq/sample2

FastQC then failed with:

SequenceFormatException: ID line didn't start with '@' at line 1

Concrete files:

runs/old_snakemake_20260422_111759/ISSUES.md
runs/old_snakemake_20260422_111759/output/AC/molecular_phenotypes/fastqc/sample1_fastqc.stderr
runs/old_snakemake_20260422_111759/output/AC/molecular_phenotypes/fastqc/sample2_fastqc.stderr

After that was worked around with a cleaned sample list, the second MWE failure appeared:

the notebook expected sample1_r1.fastq_fastqc.html
FastQC wrote sample1_r1_fastqc.html

That happened because the MWE uses compressed inputs like:

sample1_r1.fastq.gz
sample1_r2.fastq.gz

Why it worked earlier

These issues stayed hidden when older runs matched the notebook’s narrow assumptions:

sample sheets were minimal and did not include extra metadata columns after ID
input reads were likely plain .fastq, so the notebook’s expected FastQC basename happened to match what FastQC produced

So this is not a new FastQC regression. The MWE exposed two stale notebook assumptions:

“everything after ID is a read file”
“compressed FASTQ basenames behave like uncompressed ones”

Why this should not undermine existing behavior

This change narrows behavior toward the notebook’s own documented interface.

existing documented fq1 / optional fq2 sample sheets continue to work
simple legacy ID + one/two read-column sheets still work
plain .fastq inputs still resolve to the same FastQC output basenames
compressed .fastq.gz inputs now resolve correctly instead of failing

2. `GWAS_QC.ipynb`

What is fixed

This PR fixes the lookup-based sample remapping logic in genotype_phenotype_sample_overlap.

The notebook now:

accepts lookup tables with either:
- sample_id, genotype_id
- sample_id, participant_id
renames phenotype columns through that lookup
filters overlap in genotype-ID space after the rename
keeps the overlap output consistently ordered as genotype_id, sample_id

Why it failed in the MWE

The old logic was internally inconsistent.

Old behavior:

rename phenotype columns from sample1/sample2 to GENO/GENO2
then filter overlap by checking whether the old sample_id values are still present in the renamed phenotype header

That can drop all rows even when the lookup is valid.

Concrete MWE example:

lookup file:
runs/old_snakemake_20260422_111759/inputs/sample_to_genotype_lookup.tsv
mappings:
- sample1 -> GENO
- sample2 -> GENO2

After rename, phenotype columns become:

GENO
GENO2

But the old filter still tested:

sample1/sample2 %in colnames(phenoFile)

That is false, so the overlap becomes empty.

This is recorded in:

runs/old_snakemake_20260422_111759/ISSUES.md

Why it worked earlier

This bug only appears when lookup-based renaming is actually used.

Earlier runs would appear fine if:

phenotype sample IDs already matched genotype IDs
or the lookup branch was effectively bypassed

So the MWE did not invent the bug. It exercised a real remapping path that the notebook had not handled consistently.

Why this should not undermine existing behavior

This change does not broaden the notebook arbitrarily. It makes the rename path self-consistent.

no-lookup runs still behave as before
lookup-based runs now operate in the same ID space before and after rename
the legacy participant_id lookup naming is still accepted

3. `covariate_hidden_factor.ipynb`

What is fixed

This PR adds drop = FALSE to the Marchenko PC extraction step:

from: residExpPC$rotated[, 1:MPPCNum]
to: residExpPC$rotated[, 1:MPPCNum, drop = FALSE]

Why it failed in the MWE

This failure only appears when Marchenko chooses exactly one hidden factor.

In the synthetic HG MWE, the phenotype input was tiny:

notebook_compare/input/route3_hg_synthetic.expression.bed.gz

That input had:

3 phenotype rows
489 samples

In that shape, Marchenko selected:

MPPCNum = 1
“it picked 1 because the Marchenko decision on this 3 x 489 residual matrix yielded only one component above threshold”
When MPPCNum == 1, the old R subset dropped from a matrix to a vector, and downstream code lost the expected matrix structure and sample labeling.

Why it worked earlier

In more production-like expression matrices, Marchenko usually selects more than one factor.

When MPPCNum > 1, the old code stays matrix-shaped and the bug remains hidden.

So this is an MWE-shape-triggered failure, but it exposes a real latent notebook bug.

Why this should not undermine existing behavior

This is the narrowest possible fix.

behavior is unchanged for MPPCNum > 1
when MPPCNum == 1, the code now preserves the expected 2D structure instead of dropping dimensions

Scope

This PR is limited to notebook fixes under code/.

It does not include:

renovated_code/
wrapper-only fixes
Snakemake backend changes
plotting repairs

Fix old notebook edge cases for FastQC, remapping, and Marchenko

e58303f

gaow merged commit 9414a39 into StatFunGen:main Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix legacy notebook edge cases exposed by MWE sample sheets, compressed FASTQ names, lookup remapping, and Marchenko k=1#1317

Fix legacy notebook edge cases exposed by MWE sample sheets, compressed FASTQ names, lookup remapping, and Marchenko k=1#1317
gaow merged 1 commit intoStatFunGen:mainfrom
hsun3163:notebook-only-old-notebook-fixes-upstream

hsun3163 commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hsun3163 commented Apr 23, 2026

Summary

1. RNA_calling.ipynb

What is fixed

Why it failed in the MWE

Why it worked earlier

Why this should not undermine existing behavior

2. GWAS_QC.ipynb

What is fixed

Why it failed in the MWE

Why it worked earlier

Why this should not undermine existing behavior

3. covariate_hidden_factor.ipynb

What is fixed

Why it failed in the MWE

Why it worked earlier

Why this should not undermine existing behavior

Scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `RNA_calling.ipynb`

2. `GWAS_QC.ipynb`

3. `covariate_hidden_factor.ipynb`