Skip to content

feat: add transcript length extraction and aggregation to ASE data lo…#4

Merged
nadjano merged 2 commits intomasterfrom
feature/add_tx_length_check
Feb 24, 2026
Merged

feat: add transcript length extraction and aggregation to ASE data lo…#4
nadjano merged 2 commits intomasterfrom
feature/add_tx_length_check

Conversation

@nadjano
Copy link
Collaborator

@nadjano nadjano commented Feb 24, 2026

For quality control also add the length per each transcript form oarfish output to anndata object

Summary by Sourcery

Add transcript length extraction from quantification files and propagate it into transcript- and gene-level annotations for ASE data.

New Features:

  • Store per-transcript lengths from quant.sf outputs when available and attach them to the transcript-level AnnData .var.
  • Aggregate transcript lengths to gene level by computing mean, minimum, and maximum length per gene.
  • Classify genes by whether their associated transcripts within each Synt_id have uniform or variable transcript lengths.

Enhancements:

  • Improve aggregation logic to include transcript length-derived metadata alongside existing gene-level annotations.

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Feb 24, 2026

Reviewer's Guide

Adds extraction of transcript lengths from quant.sf files into the transcript-level AnnData, then aggregates transcript length statistics and a synteny-based length-uniformity flag at the gene level, along with minor formatting/cleanup changes.

Sequence diagram for loading and aggregating transcript length information

sequenceDiagram
    participant Q as QuantSfFile
    participant L as _load_sample_counts
    participant LD as load_ase_data
    participant ATG as aggregate_transcripts_to_genes
    participant ATV as adata_tx.var
    participant AGV as adata_gene.var

    Q->>L: read quant.sf
    L->>L: detect format and set index
    alt has len column
        L->>L: set result.em_counts
        L->>L: set result.tx_lengths
    else no len column
        L->>L: set result.em_counts only
    end
    L-->>LD: result with tx_lengths optional

    LD->>LD: iterate sample_results to find first tx_lengths
    alt tx_lengths found
        LD->>ATV: set tx_length by mapping transcript_id to tx_lengths
    else none found
        LD->>LD: log no transcript length data
    end

    LD-->>ATG: adata_tx

    ATG->>ATV: read tx_length and gene_id
    alt tx_length present
        ATG->>AGV: mean_tx_length per gene_id
        ATG->>AGV: min_tx_length per gene_id
        ATG->>AGV: max_tx_length per gene_id
    else tx_length absent
        ATG->>ATG: skip length stats
    end

    alt Synt_id and tx_length present
        ATG->>ATV: read Synt_id, tx_length
        ATG->>AGV: synt_length_category per gene_id
    else missing Synt_id or tx_length
        ATG->>ATG: skip synt_length_category
    end

    ATG-->>LD: adata_gene with length annotations
Loading

Class diagram for updated transcript and gene AnnData .var structures

classDiagram
    class TranscriptVar {
        +Index transcript_id
        +Index gene_id
        +Index Synt_id
        +Index haplotype
        +Index CDS_length_category
        +Index CDS_percent_difference
        +Index feature_type
        +Index tx_length
    }

    class GeneVar {
        +Index gene_id
        +Index feature_type
        +Index transcript_id
        +Index Synt_id
        +Index synteny_category
        +Index syntenic_genes
        +Index haplotype
        +Index CDS_length_category
        +Index CDS_percent_difference
        +Index n_transcripts
        +Index mean_tx_length
        +Index min_tx_length
        +Index max_tx_length
        +Index synt_length_category
    }

    TranscriptVar "*" --> "1" GeneVar : aggregates_to

    class LoadAseData {
        +load_ase_data(isoform_counts_dir, quant_dir, filter_dict, samples, conditions)
    }

    class AggregateTranscriptsToGenes {
        +aggregate_transcripts_to_genes(adata_tx)
    }

    LoadAseData ..> TranscriptVar : populates
    AggregateTranscriptsToGenes ..> TranscriptVar : reads
    AggregateTranscriptsToGenes ..> GeneVar : writes
Loading

Flow diagram for transcript length extraction and aggregation

flowchart LR
    subgraph Load_sample_counts
        A[quant.sf file] --> B{Format detection}
        B -->|Oarfish format with len| C[Set index tname]
        B -->|Salmon format with len| D[Set index Name]
        C --> E[em_counts from num_reads]
        C --> F[tx_lengths from len]
        D --> E
        D --> F
        E --> G[result.em_counts]
        F --> H[result.tx_lengths]
        G --> I[sample_results]
        H --> I
    end

    subgraph Load_ase_data
        I --> J[Iterate sample_results to find first tx_lengths]
        J -->|found| K[tx_lengths series]
        J -->|not found| L[Log no transcript length data]
        K --> M[isoform_var from all_transcript_ids]
        M --> N[isoform_var.tx_length = map index to tx_lengths]
    end

    subgraph Aggregate_transcripts_to_genes
        N --> O[adata_tx.var contains tx_length]
        O --> P[Group by gene_id to compute mean_tx_length]
        O --> Q[Compute min_tx_length]
        O --> R[Compute max_tx_length]
        P --> S[gene_var.mean_tx_length]
        Q --> T[gene_var.min_tx_length]
        R --> U[gene_var.max_tx_length]

        O --> V{Synt_id present?}
        V -->|yes| W[Build synt_tx with gene_id, Synt_id, tx_length]
        W --> X[Group by Synt_id, count unique tx_length]
        X --> Y[Map to uniform_length or variable_length]
        Y --> Z[Map Synt_id to gene_id]
        Z --> AA[gene_var.synt_length_category]
        V -->|no| AB[Skip synt_length_category]
    end

    AA --> AC[Return gene-level AnnData]
    S --> AC
    T --> AC
    U --> AC
Loading

File-Level Changes

Change Details Files
Capture transcript lengths from quant.sf and store them per-sample for later use.
  • Extend the per-sample result dict to include a tx_lengths field initialized to None.
  • When parsing Oarfish-formatted quant.sf, if a len column exists, store it as tx_lengths indexed by transcript name.
  • When parsing Salmon-formatted quant.sf, if a len column exists, store it as tx_lengths indexed by transcript name.
polyase/ase_data_loader.py
Populate transcript-level metadata with transcript lengths in the transcript AnnData object.
  • After collecting all transcript IDs, scan sample_results to find the first sample containing tx_lengths and store it as the global length reference.
  • Log whether transcript lengths were found across samples.
  • If tx_lengths is available, map transcript IDs to lengths and add a tx_length column to isoform_var, logging how many transcripts received length annotations.
polyase/ase_data_loader.py
Aggregate transcript length statistics and synteny-based length uniformity from transcript-level to gene-level annotations.
  • During gene-level aggregation, if tx_length is present, compute mean, min, and max transcript lengths per gene and add them to gene_var.
  • If tx_length is missing, log that length aggregation is skipped.
  • If both Synt_id and tx_length exist, compute per-Synt_id length uniqueness to classify each Synt_id as uniform_length or variable_length, map this to genes as synt_length_category, and log category counts.
  • If Synt_id or tx_length are missing, log that synt_length_category computation is skipped.
polyase/ase_data_loader.py
Minor style and readability cleanups around aggregation logic.
  • Reformat the layers_to_aggregate list and csr_matrix conversion expression for readability and consistent indentation.
  • Simplify a comment about metadata aggregation to remove the performance note while keeping intent clear.
  • Remove an unnecessary whitespace-only change near the return statement.
polyase/ase_data_loader.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The updated result dict in _load_sample_counts has lost its indentation relative to the function body; re-indent its keys/values to match surrounding code style and avoid confusing diffs.
  • In the Salmon branch you check for 'len' in em_df.columns, but Salmon quant.sf typically uses 'Length'; confirm and adjust the column name so transcript lengths are actually captured for Salmon outputs.
  • The new length-handling and aggregation paths add several print statements inside library code; consider switching these to a logger or making them optional to avoid noisy stdout in downstream use.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The updated `result` dict in `_load_sample_counts` has lost its indentation relative to the function body; re-indent its keys/values to match surrounding code style and avoid confusing diffs.
- In the Salmon branch you check for `'len'` in `em_df.columns`, but Salmon quant.sf typically uses `'Length'`; confirm and adjust the column name so transcript lengths are actually captured for Salmon outputs.
- The new length-handling and aggregation paths add several `print` statements inside library code; consider switching these to a logger or making them optional to avoid noisy stdout in downstream use.

## Individual Comments

### Comment 1
<location path="polyase/ase_data_loader.py" line_range="545-557" />
<code_context>
+
+    # --- Per-Synt_id: flag whether all transcripts share the same length ---
+    if 'Synt_id' in adata_tx.var.columns and 'tx_length' in adata_tx.var.columns:
+        synt_tx = adata_tx.var[['gene_id', 'Synt_id', 'tx_length']].dropna(subset=['Synt_id', 'tx_length'])
+        synt_length_nunique = synt_tx.groupby('Synt_id')['tx_length'].nunique()
+        synt_uniform = synt_length_nunique.map(lambda n: 'uniform_length' if n == 1 else 'variable_length')
+        gene_synt = tx_var_valid[['gene_id', 'Synt_id']].drop_duplicates('gene_id').set_index('gene_id')
+        gene_var['synt_length_category'] = (
+            gene_synt['Synt_id']
+            .map(synt_uniform)
+            .reindex(unique_genes)
+        )
+        counts = gene_var['synt_length_category'].value_counts()
+        print(f"synt_length_category: {counts.to_dict()}")
+    elif 'Synt_id' not in adata_tx.var.columns:
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The Synt_id/length aggregation mixes all transcripts with a Synt_id, not just those in `tx_var_valid`, which could lead to subtle inconsistencies.

Since `synt_tx` is derived from all of `adata_tx.var`, but other aggregations use `tx_var_valid`, transcripts excluded by `unique_tx_mask` can still affect `synt_length_category`. To keep gene-level metrics consistent, derive `synt_tx` from `tx_var_valid` (or the same filtered subset) before grouping by `Synt_id`.

```suggestion
    # --- Per-Synt_id: flag whether all transcripts share the same length ---
    if 'Synt_id' in adata_tx.var.columns and 'tx_length' in adata_tx.var.columns:
        # Use the same filtered subset (tx_var_valid) used for other gene-level aggregations
        synt_tx = tx_var_valid[['gene_id', 'Synt_id', 'tx_length']].dropna(subset=['Synt_id', 'tx_length'])
        synt_length_nunique = synt_tx.groupby('Synt_id')['tx_length'].nunique()
        synt_uniform = synt_length_nunique.map(lambda n: 'uniform_length' if n == 1 else 'variable_length')
        gene_synt = tx_var_valid[['gene_id', 'Synt_id']].drop_duplicates('gene_id').set_index('gene_id')
        gene_var['synt_length_category'] = (
            gene_synt['Synt_id']
            .map(synt_uniform)
            .reindex(unique_genes)
        )
        counts = gene_var['synt_length_category'].value_counts()
        print(f"synt_length_category: {counts.to_dict()}")
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
@nadjano nadjano merged commit 790ed6a into master Feb 24, 2026
2 checks passed
@nadjano nadjano deleted the feature/add_tx_length_check branch February 24, 2026 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant