feat: add transcript length extraction and aggregation to ASE data lo…#4
Merged
feat: add transcript length extraction and aggregation to ASE data lo…#4
Conversation
Contributor
Reviewer's GuideAdds extraction of transcript lengths from quant.sf files into the transcript-level AnnData, then aggregates transcript length statistics and a synteny-based length-uniformity flag at the gene level, along with minor formatting/cleanup changes. Sequence diagram for loading and aggregating transcript length informationsequenceDiagram
participant Q as QuantSfFile
participant L as _load_sample_counts
participant LD as load_ase_data
participant ATG as aggregate_transcripts_to_genes
participant ATV as adata_tx.var
participant AGV as adata_gene.var
Q->>L: read quant.sf
L->>L: detect format and set index
alt has len column
L->>L: set result.em_counts
L->>L: set result.tx_lengths
else no len column
L->>L: set result.em_counts only
end
L-->>LD: result with tx_lengths optional
LD->>LD: iterate sample_results to find first tx_lengths
alt tx_lengths found
LD->>ATV: set tx_length by mapping transcript_id to tx_lengths
else none found
LD->>LD: log no transcript length data
end
LD-->>ATG: adata_tx
ATG->>ATV: read tx_length and gene_id
alt tx_length present
ATG->>AGV: mean_tx_length per gene_id
ATG->>AGV: min_tx_length per gene_id
ATG->>AGV: max_tx_length per gene_id
else tx_length absent
ATG->>ATG: skip length stats
end
alt Synt_id and tx_length present
ATG->>ATV: read Synt_id, tx_length
ATG->>AGV: synt_length_category per gene_id
else missing Synt_id or tx_length
ATG->>ATG: skip synt_length_category
end
ATG-->>LD: adata_gene with length annotations
Class diagram for updated transcript and gene AnnData .var structuresclassDiagram
class TranscriptVar {
+Index transcript_id
+Index gene_id
+Index Synt_id
+Index haplotype
+Index CDS_length_category
+Index CDS_percent_difference
+Index feature_type
+Index tx_length
}
class GeneVar {
+Index gene_id
+Index feature_type
+Index transcript_id
+Index Synt_id
+Index synteny_category
+Index syntenic_genes
+Index haplotype
+Index CDS_length_category
+Index CDS_percent_difference
+Index n_transcripts
+Index mean_tx_length
+Index min_tx_length
+Index max_tx_length
+Index synt_length_category
}
TranscriptVar "*" --> "1" GeneVar : aggregates_to
class LoadAseData {
+load_ase_data(isoform_counts_dir, quant_dir, filter_dict, samples, conditions)
}
class AggregateTranscriptsToGenes {
+aggregate_transcripts_to_genes(adata_tx)
}
LoadAseData ..> TranscriptVar : populates
AggregateTranscriptsToGenes ..> TranscriptVar : reads
AggregateTranscriptsToGenes ..> GeneVar : writes
Flow diagram for transcript length extraction and aggregationflowchart LR
subgraph Load_sample_counts
A[quant.sf file] --> B{Format detection}
B -->|Oarfish format with len| C[Set index tname]
B -->|Salmon format with len| D[Set index Name]
C --> E[em_counts from num_reads]
C --> F[tx_lengths from len]
D --> E
D --> F
E --> G[result.em_counts]
F --> H[result.tx_lengths]
G --> I[sample_results]
H --> I
end
subgraph Load_ase_data
I --> J[Iterate sample_results to find first tx_lengths]
J -->|found| K[tx_lengths series]
J -->|not found| L[Log no transcript length data]
K --> M[isoform_var from all_transcript_ids]
M --> N[isoform_var.tx_length = map index to tx_lengths]
end
subgraph Aggregate_transcripts_to_genes
N --> O[adata_tx.var contains tx_length]
O --> P[Group by gene_id to compute mean_tx_length]
O --> Q[Compute min_tx_length]
O --> R[Compute max_tx_length]
P --> S[gene_var.mean_tx_length]
Q --> T[gene_var.min_tx_length]
R --> U[gene_var.max_tx_length]
O --> V{Synt_id present?}
V -->|yes| W[Build synt_tx with gene_id, Synt_id, tx_length]
W --> X[Group by Synt_id, count unique tx_length]
X --> Y[Map to uniform_length or variable_length]
Y --> Z[Map Synt_id to gene_id]
Z --> AA[gene_var.synt_length_category]
V -->|no| AB[Skip synt_length_category]
end
AA --> AC[Return gene-level AnnData]
S --> AC
T --> AC
U --> AC
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
Contributor
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The updated
resultdict in_load_sample_countshas lost its indentation relative to the function body; re-indent its keys/values to match surrounding code style and avoid confusing diffs. - In the Salmon branch you check for
'len'inem_df.columns, but Salmon quant.sf typically uses'Length'; confirm and adjust the column name so transcript lengths are actually captured for Salmon outputs. - The new length-handling and aggregation paths add several
printstatements inside library code; consider switching these to a logger or making them optional to avoid noisy stdout in downstream use.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The updated `result` dict in `_load_sample_counts` has lost its indentation relative to the function body; re-indent its keys/values to match surrounding code style and avoid confusing diffs.
- In the Salmon branch you check for `'len'` in `em_df.columns`, but Salmon quant.sf typically uses `'Length'`; confirm and adjust the column name so transcript lengths are actually captured for Salmon outputs.
- The new length-handling and aggregation paths add several `print` statements inside library code; consider switching these to a logger or making them optional to avoid noisy stdout in downstream use.
## Individual Comments
### Comment 1
<location path="polyase/ase_data_loader.py" line_range="545-557" />
<code_context>
+
+ # --- Per-Synt_id: flag whether all transcripts share the same length ---
+ if 'Synt_id' in adata_tx.var.columns and 'tx_length' in adata_tx.var.columns:
+ synt_tx = adata_tx.var[['gene_id', 'Synt_id', 'tx_length']].dropna(subset=['Synt_id', 'tx_length'])
+ synt_length_nunique = synt_tx.groupby('Synt_id')['tx_length'].nunique()
+ synt_uniform = synt_length_nunique.map(lambda n: 'uniform_length' if n == 1 else 'variable_length')
+ gene_synt = tx_var_valid[['gene_id', 'Synt_id']].drop_duplicates('gene_id').set_index('gene_id')
+ gene_var['synt_length_category'] = (
+ gene_synt['Synt_id']
+ .map(synt_uniform)
+ .reindex(unique_genes)
+ )
+ counts = gene_var['synt_length_category'].value_counts()
+ print(f"synt_length_category: {counts.to_dict()}")
+ elif 'Synt_id' not in adata_tx.var.columns:
</code_context>
<issue_to_address>
**suggestion (bug_risk):** The Synt_id/length aggregation mixes all transcripts with a Synt_id, not just those in `tx_var_valid`, which could lead to subtle inconsistencies.
Since `synt_tx` is derived from all of `adata_tx.var`, but other aggregations use `tx_var_valid`, transcripts excluded by `unique_tx_mask` can still affect `synt_length_category`. To keep gene-level metrics consistent, derive `synt_tx` from `tx_var_valid` (or the same filtered subset) before grouping by `Synt_id`.
```suggestion
# --- Per-Synt_id: flag whether all transcripts share the same length ---
if 'Synt_id' in adata_tx.var.columns and 'tx_length' in adata_tx.var.columns:
# Use the same filtered subset (tx_var_valid) used for other gene-level aggregations
synt_tx = tx_var_valid[['gene_id', 'Synt_id', 'tx_length']].dropna(subset=['Synt_id', 'tx_length'])
synt_length_nunique = synt_tx.groupby('Synt_id')['tx_length'].nunique()
synt_uniform = synt_length_nunique.map(lambda n: 'uniform_length' if n == 1 else 'variable_length')
gene_synt = tx_var_valid[['gene_id', 'Synt_id']].drop_duplicates('gene_id').set_index('gene_id')
gene_var['synt_length_category'] = (
gene_synt['Synt_id']
.map(synt_uniform)
.reindex(unique_genes)
)
counts = gene_var['synt_length_category'].value_counts()
print(f"synt_length_category: {counts.to_dict()}")
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Co-authored-by: sourcery-ai[bot] <58596630+sourcery-ai[bot]@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
For quality control also add the length per each transcript form oarfish output to anndata object
Summary by Sourcery
Add transcript length extraction from quantification files and propagate it into transcript- and gene-level annotations for ASE data.
New Features:
Enhancements: