Skip to content

Fix/allele specific structure testing ratio error#3

Merged
nadjano merged 2 commits intomasterfrom
fix/allele_specific_structure_testing_ratio_error
Feb 24, 2026
Merged

Fix/allele specific structure testing ratio error#3
nadjano merged 2 commits intomasterfrom
fix/allele_specific_structure_testing_ratio_error

Conversation

@nadjano
Copy link
Collaborator

@nadjano nadjano commented Feb 24, 2026

When matching reference isoforms to transcripts in other haplotypes, each reference isoform searched the full transcript pool independently, meaning two different reference isoforms could claim the same target transcript — causing both to report identical counts and, in the worst case, isoform_ratio = 1.0 for both, which is mathematically impossible. The fix replaces this independent matching with a greedy one-to-one assignment: similarity scores are computed for all reference isoform × haplotype transcript pairs upfront, then assigned in descending order of similarity so each haplotype transcript can only be claimed once. Only the plotting data generation is affected — the statistical testing logic is unchanged.

Summary by Sourcery

Fix isoform-to-transcript matching for plotting and enrich ASE data with transcript length metadata and gene-level length summaries.

Bug Fixes:

  • Ensure reference isoforms are matched to haplotype transcripts via a greedy one-to-one similarity-based assignment to prevent multiple isoforms claiming the same transcript and producing invalid isoform ratios.

Enhancements:

  • Add loading of transcript lengths from quantification outputs into the ASE dataset and propagate them into transcript-level metadata.
  • Aggregate transcript length statistics to the gene level and derive a synteny-based length uniformity category for downstream analyses.
  • Minorly tidy aggregation logic and metadata comments in transcript-to-gene aggregation code.

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Feb 24, 2026

Reviewer's Guide

Implements a greedy one-to-one isoform-to-transcript matching strategy for plotting to avoid multiple reference isoforms mapping to the same haplotype transcript, and propagates transcript length metadata from quantification files into transcript- and gene-level AnnData, including aggregated length statistics and synteny-based length uniformity flags.

ER diagram for transcript and gene metadata with new length fields

erDiagram
    Transcript ||--o{ Gene : belongs_to

    Transcript {
        string transcript_id
        string gene_id
        string haplotype
        int tx_length
        string Synt_id
        string synteny_category
        string CDS_length_category
        float CDS_percent_difference
    }

    Gene {
        string gene_id
        float mean_tx_length
        int min_tx_length
        int max_tx_length
        string synt_length_category
        int n_transcripts
    }

    SyntenyGroup ||--o{ Transcript : groups
    SyntenyGroup ||--o{ Gene : summarizes

    SyntenyGroup {
        string Synt_id
        string synt_length_category
    }
Loading

File-Level Changes

Change Details Files
Replace per-isoform independent best-hit matching with greedy global one-to-one assignment when mapping reference isoforms to haplotype transcripts for plotting.
  • Remove the docstring on _match_all_isoforms_for_plotting (no behavior change).
  • For the reference haplotype, directly map each reference isoform to itself with similarity 1.0 and skip further processing.
  • For non-reference haplotypes, compute structure similarity for all reference-isoform × haplotype-transcript pairs and store in a similarity matrix.
  • Sort all isoform–transcript pairs by similarity descending and greedily assign pairs while preventing reuse of haplotype transcripts and reference isoforms, respecting the min_similarity_for_matching threshold.
  • Stop iterating remaining pairs once similarity falls below the minimum threshold since they are sorted descending.
polyase/stats.py
Load per-transcript length information from quant.sf files when available and attach it to transcript-level AnnData.
  • Extend the per-sample _load_sample_counts result dict to include a tx_lengths field, defaulting to None.
  • When parsing Oarfish- or Salmon-style quant.sf, extract the 'len' column (if present) and store it in tx_lengths alongside em_counts.
  • In load_ase_data, scan sample_results for the first sample that has tx_lengths, log whether lengths were found, and keep that Series for later use.
  • Map the collected tx_lengths onto isoform_var (transcript-level .var) as a new tx_length column and log how many transcripts received length annotations.
polyase/ase_data_loader.py
Aggregate transcript length information and synteny-based length uniformity at the gene level when building gene-level AnnData.
  • During aggregate_transcripts_to_genes, if tx_length exists in transcript .var, compute mean, min, and max transcript lengths per gene and store them in gene_var.
  • Compute per-Synt_id whether all transcripts have identical length vs variable length using tx_length, then map this to genes as synt_length_category and log category counts.
  • Add defensive logging paths when Synt_id or tx_length are missing to explain why length-based synteny annotations are skipped.
  • Refactor the layer aggregation code slightly for readability while preserving behavior, including a minor comment adjustment about groupby-based metadata aggregation.
polyase/ase_data_loader.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • In the greedy assignment loop in _match_all_isoforms_for_plotting, the condition if similarity < min_similarity_for_matching: continue # remaining pairs will only be worse should likely break instead of continue, since pairs are sorted descending and anything after the first below-threshold similarity can’t pass the filter.
  • The new greedy matching builds a full similarity_matrix dict of all ref-isoform × hap-transcript pairs; if these sets are large, consider using a more memory-efficient structure (e.g. a list of (ref_iso_id, idx, similarity) tuples) or computing similarities incrementally to avoid materializing the full dense matrix.
  • In _load_sample_counts, the indentation of result’s keys and the if 'len' in em_df.columns: block for the Oarfish branch looks inconsistent with the surrounding code and may cause readability or execution issues; align it with the Salmon branch pattern.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In the greedy assignment loop in `_match_all_isoforms_for_plotting`, the condition `if similarity < min_similarity_for_matching: continue  # remaining pairs will only be worse` should likely `break` instead of `continue`, since pairs are sorted descending and anything after the first below-threshold similarity can’t pass the filter.
- The new greedy matching builds a full `similarity_matrix` dict of all ref-isoform × hap-transcript pairs; if these sets are large, consider using a more memory-efficient structure (e.g. a list of `(ref_iso_id, idx, similarity)` tuples) or computing similarities incrementally to avoid materializing the full dense matrix.
- In `_load_sample_counts`, the indentation of `result`’s keys and the `if 'len' in em_df.columns:` block for the Oarfish branch looks inconsistent with the surrounding code and may cause readability or execution issues; align it with the Salmon branch pattern.

## Individual Comments

### Comment 1
<location path="polyase/ase_data_loader.py" line_range="85-83" />
<code_context>
                 result['em_counts'] = em_df['num_reads']
+                if 'len' in em_df.columns:              
+                        result['tx_lengths'] = em_df['len'] 
             elif 'Name' in em_df.columns and 'NumReads' in em_df.columns:
                 print("Salmon quant.sf format detected")
                 em_df = em_df.set_index('Name')
                 result['em_counts'] = em_df['NumReads']
+                if 'len' in em_df.columns:              
+                    result['tx_lengths'] = em_df['len'] 
             else:
</code_context>
<issue_to_address>
**suggestion:** Length column detection for Salmon quant.sf may not match typical column naming.

Salmon `quant.sf` typically uses a `Length` column (capital L), so checking only for `len` may leave `tx_lengths` unset for common outputs. Please handle both (e.g., use `len` if present, otherwise fall back to `Length`).

Suggested implementation:

```python
                em_df = em_df.set_index('tname')
                result['em_counts'] = em_df['num_reads']
                if 'len' in em_df.columns:
                        result['tx_lengths'] = em_df['len']
                elif 'Length' in em_df.columns:
                        result['tx_lengths'] = em_df['Length']

```

```python
                em_df = em_df.set_index('Name')
                result['em_counts'] = em_df['NumReads']
                if 'len' in em_df.columns:
                    result['tx_lengths'] = em_df['len']
                elif 'Length' in em_df.columns:
                    result['tx_lengths'] = em_df['Length']

```
</issue_to_address>

### Comment 2
<location path="polyase/ase_data_loader.py" line_range="546-551" />
<code_context>
+        print("No 'tx_length' column found in transcript .var, skipping length aggregation")
+
+    # --- Per-Synt_id: flag whether all transcripts share the same length ---
+    if 'Synt_id' in adata_tx.var.columns and 'tx_length' in adata_tx.var.columns:
+        synt_tx = adata_tx.var[['gene_id', 'Synt_id', 'tx_length']].dropna(subset=['Synt_id', 'tx_length'])
+        synt_length_nunique = synt_tx.groupby('Synt_id')['tx_length'].nunique()
+        synt_uniform = synt_length_nunique.map(lambda n: 'uniform_length' if n == 1 else 'variable_length')
+        gene_synt = tx_var_valid[['gene_id', 'Synt_id']].drop_duplicates('gene_id').set_index('gene_id')
+        gene_var['synt_length_category'] = (
+            gene_synt['Synt_id']
+            .map(synt_uniform)
</code_context>
<issue_to_address>
**question:** The synt_length_category computation ignores transcripts without lengths, which may misclassify some Synt_id groups.

Since `synt_tx` drops rows with missing `tx_length`, `nunique()` only reflects transcripts that have a length. If some transcripts in a `Synt_id` lack length, a genuinely variable-length group could be marked `uniform_length` based on the incomplete subset. If this distinction is important downstream, consider marking any `Synt_id` with partially missing lengths as `variable_length` or introducing a separate category instead of ignoring those transcripts.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

print("Oarfish quant.sf format detected")
em_df = em_df.set_index('tname')
result['em_counts'] = em_df['num_reads']
if 'len' in em_df.columns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Length column detection for Salmon quant.sf may not match typical column naming.

Salmon quant.sf typically uses a Length column (capital L), so checking only for len may leave tx_lengths unset for common outputs. Please handle both (e.g., use len if present, otherwise fall back to Length).

Suggested implementation:

                em_df = em_df.set_index('tname')
                result['em_counts'] = em_df['num_reads']
                if 'len' in em_df.columns:
                        result['tx_lengths'] = em_df['len']
                elif 'Length' in em_df.columns:
                        result['tx_lengths'] = em_df['Length']
                em_df = em_df.set_index('Name')
                result['em_counts'] = em_df['NumReads']
                if 'len' in em_df.columns:
                    result['tx_lengths'] = em_df['len']
                elif 'Length' in em_df.columns:
                    result['tx_lengths'] = em_df['Length']

Comment on lines +546 to +551
if 'Synt_id' in adata_tx.var.columns and 'tx_length' in adata_tx.var.columns:
synt_tx = adata_tx.var[['gene_id', 'Synt_id', 'tx_length']].dropna(subset=['Synt_id', 'tx_length'])
synt_length_nunique = synt_tx.groupby('Synt_id')['tx_length'].nunique()
synt_uniform = synt_length_nunique.map(lambda n: 'uniform_length' if n == 1 else 'variable_length')
gene_synt = tx_var_valid[['gene_id', 'Synt_id']].drop_duplicates('gene_id').set_index('gene_id')
gene_var['synt_length_category'] = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: The synt_length_category computation ignores transcripts without lengths, which may misclassify some Synt_id groups.

Since synt_tx drops rows with missing tx_length, nunique() only reflects transcripts that have a length. If some transcripts in a Synt_id lack length, a genuinely variable-length group could be marked uniform_length based on the incomplete subset. If this distinction is important downstream, consider marking any Synt_id with partially missing lengths as variable_length or introducing a separate category instead of ignoring those transcripts.

@nadjano nadjano merged commit e697a53 into master Feb 24, 2026
2 checks passed
@nadjano nadjano deleted the fix/allele_specific_structure_testing_ratio_error branch February 24, 2026 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant