feat: add transcript length extraction and aggregation to ASE data lo… by nadjano · Pull Request #4 · NIB-SI/polyase

nadjano · 2026-02-24T15:06:01Z

For quality control also add the length per each transcript form oarfish output to anndata object

Summary by Sourcery

Add transcript length extraction from quantification files and propagate it into transcript- and gene-level annotations for ASE data.

New Features:

Store per-transcript lengths from quant.sf outputs when available and attach them to the transcript-level AnnData .var.
Aggregate transcript lengths to gene level by computing mean, minimum, and maximum length per gene.
Classify genes by whether their associated transcripts within each Synt_id have uniform or variable transcript lengths.

Enhancements:

Improve aggregation logic to include transcript length-derived metadata alongside existing gene-level annotations.

…ading

sourcery-ai · 2026-02-24T15:06:07Z

Reviewer's Guide

Adds extraction of transcript lengths from quant.sf files into the transcript-level AnnData, then aggregates transcript length statistics and a synteny-based length-uniformity flag at the gene level, along with minor formatting/cleanup changes.

Sequence diagram for loading and aggregating transcript length information

sequenceDiagram
    participant Q as QuantSfFile
    participant L as _load_sample_counts
    participant LD as load_ase_data
    participant ATG as aggregate_transcripts_to_genes
    participant ATV as adata_tx.var
    participant AGV as adata_gene.var

    Q->>L: read quant.sf
    L->>L: detect format and set index
    alt has len column
        L->>L: set result.em_counts
        L->>L: set result.tx_lengths
    else no len column
        L->>L: set result.em_counts only
    end
    L-->>LD: result with tx_lengths optional

    LD->>LD: iterate sample_results to find first tx_lengths
    alt tx_lengths found
        LD->>ATV: set tx_length by mapping transcript_id to tx_lengths
    else none found
        LD->>LD: log no transcript length data
    end

    LD-->>ATG: adata_tx

    ATG->>ATV: read tx_length and gene_id
    alt tx_length present
        ATG->>AGV: mean_tx_length per gene_id
        ATG->>AGV: min_tx_length per gene_id
        ATG->>AGV: max_tx_length per gene_id
    else tx_length absent
        ATG->>ATG: skip length stats
    end

    alt Synt_id and tx_length present
        ATG->>ATV: read Synt_id, tx_length
        ATG->>AGV: synt_length_category per gene_id
    else missing Synt_id or tx_length
        ATG->>ATG: skip synt_length_category
    end

    ATG-->>LD: adata_gene with length annotations

Class diagram for updated transcript and gene AnnData .var structures

classDiagram
    class TranscriptVar {
        +Index transcript_id
        +Index gene_id
        +Index Synt_id
        +Index haplotype
        +Index CDS_length_category
        +Index CDS_percent_difference
        +Index feature_type
        +Index tx_length
    }

    class GeneVar {
        +Index gene_id
        +Index feature_type
        +Index transcript_id
        +Index Synt_id
        +Index synteny_category
        +Index syntenic_genes
        +Index haplotype
        +Index CDS_length_category
        +Index CDS_percent_difference
        +Index n_transcripts
        +Index mean_tx_length
        +Index min_tx_length
        +Index max_tx_length
        +Index synt_length_category
    }

    TranscriptVar "*" --> "1" GeneVar : aggregates_to

    class LoadAseData {
        +load_ase_data(isoform_counts_dir, quant_dir, filter_dict, samples, conditions)
    }

    class AggregateTranscriptsToGenes {
        +aggregate_transcripts_to_genes(adata_tx)
    }

    LoadAseData ..> TranscriptVar : populates
    AggregateTranscriptsToGenes ..> TranscriptVar : reads
    AggregateTranscriptsToGenes ..> GeneVar : writes

Flow diagram for transcript length extraction and aggregation

flowchart LR
    subgraph Load_sample_counts
        A[quant.sf file] --> B{Format detection}
        B -->|Oarfish format with len| C[Set index tname]
        B -->|Salmon format with len| D[Set index Name]
        C --> E[em_counts from num_reads]
        C --> F[tx_lengths from len]
        D --> E
        D --> F
        E --> G[result.em_counts]
        F --> H[result.tx_lengths]
        G --> I[sample_results]
        H --> I
    end

    subgraph Load_ase_data
        I --> J[Iterate sample_results to find first tx_lengths]
        J -->|found| K[tx_lengths series]
        J -->|not found| L[Log no transcript length data]
        K --> M[isoform_var from all_transcript_ids]
        M --> N[isoform_var.tx_length = map index to tx_lengths]
    end

    subgraph Aggregate_transcripts_to_genes
        N --> O[adata_tx.var contains tx_length]
        O --> P[Group by gene_id to compute mean_tx_length]
        O --> Q[Compute min_tx_length]
        O --> R[Compute max_tx_length]
        P --> S[gene_var.mean_tx_length]
        Q --> T[gene_var.min_tx_length]
        R --> U[gene_var.max_tx_length]

        O --> V{Synt_id present?}
        V -->|yes| W[Build synt_tx with gene_id, Synt_id, tx_length]
        W --> X[Group by Synt_id, count unique tx_length]
        X --> Y[Map to uniform_length or variable_length]
        Y --> Z[Map Synt_id to gene_id]
        Z --> AA[gene_var.synt_length_category]
        V -->|no| AB[Skip synt_length_category]
    end

    AA --> AC[Return gene-level AnnData]
    S --> AC
    T --> AC
    U --> AC

File-Level Changes

Change	Details	Files
Capture transcript lengths from quant.sf and store them per-sample for later use.	Extend the per-sample result dict to include a tx_lengths field initialized to None. When parsing Oarfish-formatted quant.sf, if a len column exists, store it as tx_lengths indexed by transcript name. When parsing Salmon-formatted quant.sf, if a len column exists, store it as tx_lengths indexed by transcript name.	`polyase/ase_data_loader.py`
Populate transcript-level metadata with transcript lengths in the transcript AnnData object.	After collecting all transcript IDs, scan sample_results to find the first sample containing tx_lengths and store it as the global length reference. Log whether transcript lengths were found across samples. If tx_lengths is available, map transcript IDs to lengths and add a tx_length column to isoform_var, logging how many transcripts received length annotations.	`polyase/ase_data_loader.py`
Aggregate transcript length statistics and synteny-based length uniformity from transcript-level to gene-level annotations.	During gene-level aggregation, if tx_length is present, compute mean, min, and max transcript lengths per gene and add them to gene_var. If tx_length is missing, log that length aggregation is skipped. If both Synt_id and tx_length exist, compute per-Synt_id length uniqueness to classify each Synt_id as uniform_length or variable_length, map this to genes as synt_length_category, and log category counts. If Synt_id or tx_length are missing, log that synt_length_category computation is skipped.	`polyase/ase_data_loader.py`
Minor style and readability cleanups around aggregation logic.	Reformat the layers_to_aggregate list and csr_matrix conversion expression for readability and consistent indentation. Simplify a comment about metadata aggregation to remove the performance note while keeping intent clear. Remove an unnecessary whitespace-only change near the return statement.	`polyase/ase_data_loader.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

The updated result dict in _load_sample_counts has lost its indentation relative to the function body; re-indent its keys/values to match surrounding code style and avoid confusing diffs.
In the Salmon branch you check for 'len' in em_df.columns, but Salmon quant.sf typically uses 'Length'; confirm and adjust the column name so transcript lengths are actually captured for Salmon outputs.
The new length-handling and aggregation paths add several print statements inside library code; consider switching these to a logger or making them optional to avoid noisy stdout in downstream use.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The updated `result` dict in `_load_sample_counts` has lost its indentation relative to the function body; re-indent its keys/values to match surrounding code style and avoid confusing diffs.
- In the Salmon branch you check for `'len'` in `em_df.columns`, but Salmon quant.sf typically uses `'Length'`; confirm and adjust the column name so transcript lengths are actually captured for Salmon outputs.
- The new length-handling and aggregation paths add several `print` statements inside library code; consider switching these to a logger or making them optional to avoid noisy stdout in downstream use.

## Individual Comments

### Comment 1
<location path="polyase/ase_data_loader.py" line_range="545-557" />
<code_context>
+
+    # --- Per-Synt_id: flag whether all transcripts share the same length ---
+    if 'Synt_id' in adata_tx.var.columns and 'tx_length' in adata_tx.var.columns:
+        synt_tx = adata_tx.var[['gene_id', 'Synt_id', 'tx_length']].dropna(subset=['Synt_id', 'tx_length'])
+        synt_length_nunique = synt_tx.groupby('Synt_id')['tx_length'].nunique()
+        synt_uniform = synt_length_nunique.map(lambda n: 'uniform_length' if n == 1 else 'variable_length')
+        gene_synt = tx_var_valid[['gene_id', 'Synt_id']].drop_duplicates('gene_id').set_index('gene_id')
+        gene_var['synt_length_category'] = (
+            gene_synt['Synt_id']
+            .map(synt_uniform)
+            .reindex(unique_genes)
+        )
+        counts = gene_var['synt_length_category'].value_counts()
+        print(f"synt_length_category: {counts.to_dict()}")
+    elif 'Synt_id' not in adata_tx.var.columns:
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The Synt_id/length aggregation mixes all transcripts with a Synt_id, not just those in `tx_var_valid`, which could lead to subtle inconsistencies.

Since `synt_tx` is derived from all of `adata_tx.var`, but other aggregations use `tx_var_valid`, transcripts excluded by `unique_tx_mask` can still affect `synt_length_category`. To keep gene-level metrics consistent, derive `synt_tx` from `tx_var_valid` (or the same filtered subset) before grouping by `Synt_id`.

```suggestion
    # --- Per-Synt_id: flag whether all transcripts share the same length ---
    if 'Synt_id' in adata_tx.var.columns and 'tx_length' in adata_tx.var.columns:
        # Use the same filtered subset (tx_var_valid) used for other gene-level aggregations
        synt_tx = tx_var_valid[['gene_id', 'Synt_id', 'tx_length']].dropna(subset=['Synt_id', 'tx_length'])
        synt_length_nunique = synt_tx.groupby('Synt_id')['tx_length'].nunique()
        synt_uniform = synt_length_nunique.map(lambda n: 'uniform_length' if n == 1 else 'variable_length')
        gene_synt = tx_var_valid[['gene_id', 'Synt_id']].drop_duplicates('gene_id').set_index('gene_id')
        gene_var['synt_length_category'] = (
            gene_synt['Synt_id']
            .map(synt_uniform)
            .reindex(unique_genes)
        )
        counts = gene_var['synt_length_category'].value_counts()
        print(f"synt_length_category: {counts.to_dict()}")
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

polyase/ase_data_loader.py

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>

feat: add transcript length extraction and aggregation to ASE data lo…

397d348

…ading

sourcery-ai bot reviewed Feb 24, 2026

View reviewed changes

polyase/ase_data_loader.py Show resolved Hide resolved

Update polyase/ase_data_loader.py

c75e56f

Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>

nadjano merged commit 790ed6a into master Feb 24, 2026
2 checks passed

nadjano deleted the feature/add_tx_length_check branch February 24, 2026 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add transcript length extraction and aggregation to ASE data lo…#4

feat: add transcript length extraction and aggregation to ASE data lo…#4
nadjano merged 2 commits intomasterfrom
feature/add_tx_length_check

nadjano commented Feb 24, 2026 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Feb 24, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nadjano commented Feb 24, 2026 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for loading and aggregating transcript length information

Class diagram for updated transcript and gene AnnData .var structures

Flow diagram for transcript length extraction and aggregation

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nadjano commented Feb 24, 2026 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Feb 24, 2026 •

edited

Loading