Skip to content

Fix legacy notebook edge cases exposed by MWE sample sheets, compressed FASTQ names, lookup remapping, and Marchenko k=1#1317

Merged
gaow merged 1 commit intoStatFunGen:mainfrom
hsun3163:notebook-only-old-notebook-fixes-upstream
Apr 24, 2026
Merged

Fix legacy notebook edge cases exposed by MWE sample sheets, compressed FASTQ names, lookup remapping, and Marchenko k=1#1317
gaow merged 1 commit intoStatFunGen:mainfrom
hsun3163:notebook-only-old-notebook-fixes-upstream

Conversation

@hsun3163
Copy link
Copy Markdown
Collaborator

Summary

This PR fixes three notebook-side edge cases in the legacy code/ notebooks:

  • code/molecular_phenotypes/calling/RNA_calling.ipynb
  • code/data_preprocessing/genotype/GWAS_QC.ipynb
  • code/data_preprocessing/covariate/covariate_hidden_factor.ipynb

This is intentionally a notebook-only PR. It does not include renovated_code/.

1. RNA_calling.ipynb

What is fixed

This notebook had two independent brittle assumptions in the fastqc path.

  1. Sample-sheet parsing now follows the documented contract.
    The notebook documents these columns:

    • ID
    • fq1
    • optional fq2
    • optional strand
    • optional read_length

    The old code dropped strand and read_length, then treated every remaining column after ID as an input read file.

    This PR changes that logic to:

    • use fq1 and optional fq2 when those columns are present
    • only fall back to a legacy ID + 1/2 read-column layout when fq1/fq2 are absent
  2. FastQC output naming now handles both fastq and fastq.gz.
    The old code expected output basenames using SoS _input:bn, which preserved .fastq for compressed inputs.
    This PR replaces that with explicit suffix stripping for:

    • .fastq.gz
    • .fq.gz
    • .fastq
    • .fq
    • .bam
    • .sam
    • .cram

Why it failed in the MWE

The first MWE failure was a real parsing error, not just a theoretical concern.

In runs/old_snakemake_20260422_111759, the old notebook treated the extra participant_id column in the MWE sample sheet as if it were a read filename, and generated FastQC
commands against:

  • mwe_data/fastq/sample1
  • mwe_data/fastq/sample2

FastQC then failed with:

  • SequenceFormatException: ID line didn't start with '@' at line 1

Concrete files:

  • runs/old_snakemake_20260422_111759/ISSUES.md
  • runs/old_snakemake_20260422_111759/output/AC/molecular_phenotypes/fastqc/sample1_fastqc.stderr
  • runs/old_snakemake_20260422_111759/output/AC/molecular_phenotypes/fastqc/sample2_fastqc.stderr

After that was worked around with a cleaned sample list, the second MWE failure appeared:

  • the notebook expected sample1_r1.fastq_fastqc.html
  • FastQC wrote sample1_r1_fastqc.html

That happened because the MWE uses compressed inputs like:

  • sample1_r1.fastq.gz
  • sample1_r2.fastq.gz

Why it worked earlier

These issues stayed hidden when older runs matched the notebook’s narrow assumptions:

  • sample sheets were minimal and did not include extra metadata columns after ID
  • input reads were likely plain .fastq, so the notebook’s expected FastQC basename happened to match what FastQC produced

So this is not a new FastQC regression. The MWE exposed two stale notebook assumptions:

  • “everything after ID is a read file”
  • “compressed FASTQ basenames behave like uncompressed ones”

Why this should not undermine existing behavior

This change narrows behavior toward the notebook’s own documented interface.

  • existing documented fq1 / optional fq2 sample sheets continue to work
  • simple legacy ID + one/two read-column sheets still work
  • plain .fastq inputs still resolve to the same FastQC output basenames
  • compressed .fastq.gz inputs now resolve correctly instead of failing

2. GWAS_QC.ipynb

What is fixed

This PR fixes the lookup-based sample remapping logic in genotype_phenotype_sample_overlap.

The notebook now:

  • accepts lookup tables with either:
    • sample_id, genotype_id
    • sample_id, participant_id
  • renames phenotype columns through that lookup
  • filters overlap in genotype-ID space after the rename
  • keeps the overlap output consistently ordered as genotype_id, sample_id

Why it failed in the MWE

The old logic was internally inconsistent.

Old behavior:

  1. rename phenotype columns from sample1/sample2 to GENO/GENO2
  2. then filter overlap by checking whether the old sample_id values are still present in the renamed phenotype header

That can drop all rows even when the lookup is valid.

Concrete MWE example:

  • lookup file:
    runs/old_snakemake_20260422_111759/inputs/sample_to_genotype_lookup.tsv
  • mappings:
    • sample1 -> GENO
    • sample2 -> GENO2

After rename, phenotype columns become:

  • GENO
  • GENO2

But the old filter still tested:

  • sample1/sample2 %in colnames(phenoFile)

That is false, so the overlap becomes empty.

This is recorded in:

  • runs/old_snakemake_20260422_111759/ISSUES.md

Why it worked earlier

This bug only appears when lookup-based renaming is actually used.

Earlier runs would appear fine if:

  • phenotype sample IDs already matched genotype IDs
  • or the lookup branch was effectively bypassed

So the MWE did not invent the bug. It exercised a real remapping path that the notebook had not handled consistently.

Why this should not undermine existing behavior

This change does not broaden the notebook arbitrarily. It makes the rename path self-consistent.

  • no-lookup runs still behave as before
  • lookup-based runs now operate in the same ID space before and after rename
  • the legacy participant_id lookup naming is still accepted

3. covariate_hidden_factor.ipynb

What is fixed

This PR adds drop = FALSE to the Marchenko PC extraction step:

  • from: residExpPC$rotated[, 1:MPPCNum]
  • to: residExpPC$rotated[, 1:MPPCNum, drop = FALSE]

Why it failed in the MWE

This failure only appears when Marchenko chooses exactly one hidden factor.

In the synthetic HG MWE, the phenotype input was tiny:

  • notebook_compare/input/route3_hg_synthetic.expression.bed.gz

That input had:

  • 3 phenotype rows
  • 489 samples

In that shape, Marchenko selected:

  • MPPCNum = 1
  • “it picked 1 because the Marchenko decision on this 3 x 489 residual matrix yielded only one component above threshold”
    When MPPCNum == 1, the old R subset dropped from a matrix to a vector, and downstream code lost the expected matrix structure and sample labeling.

Why it worked earlier

In more production-like expression matrices, Marchenko usually selects more than one factor.

When MPPCNum > 1, the old code stays matrix-shaped and the bug remains hidden.

So this is an MWE-shape-triggered failure, but it exposes a real latent notebook bug.

Why this should not undermine existing behavior

This is the narrowest possible fix.

  • behavior is unchanged for MPPCNum > 1
  • when MPPCNum == 1, the code now preserves the expected 2D structure instead of dropping dimensions

Scope

This PR is limited to notebook fixes under code/.

It does not include:

  • renovated_code/
  • wrapper-only fixes
  • Snakemake backend changes
  • plotting repairs

@gaow gaow merged commit 9414a39 into StatFunGen:main Apr 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants