Fix/allele specific structure testing ratio error#3
Conversation
Reviewer's GuideImplements a greedy one-to-one isoform-to-transcript matching strategy for plotting to avoid multiple reference isoforms mapping to the same haplotype transcript, and propagates transcript length metadata from quantification files into transcript- and gene-level AnnData, including aggregated length statistics and synteny-based length uniformity flags. ER diagram for transcript and gene metadata with new length fieldserDiagram
Transcript ||--o{ Gene : belongs_to
Transcript {
string transcript_id
string gene_id
string haplotype
int tx_length
string Synt_id
string synteny_category
string CDS_length_category
float CDS_percent_difference
}
Gene {
string gene_id
float mean_tx_length
int min_tx_length
int max_tx_length
string synt_length_category
int n_transcripts
}
SyntenyGroup ||--o{ Transcript : groups
SyntenyGroup ||--o{ Gene : summarizes
SyntenyGroup {
string Synt_id
string synt_length_category
}
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- In the greedy assignment loop in
_match_all_isoforms_for_plotting, the conditionif similarity < min_similarity_for_matching: continue # remaining pairs will only be worseshould likelybreakinstead ofcontinue, since pairs are sorted descending and anything after the first below-threshold similarity can’t pass the filter. - The new greedy matching builds a full
similarity_matrixdict of all ref-isoform × hap-transcript pairs; if these sets are large, consider using a more memory-efficient structure (e.g. a list of(ref_iso_id, idx, similarity)tuples) or computing similarities incrementally to avoid materializing the full dense matrix. - In
_load_sample_counts, the indentation ofresult’s keys and theif 'len' in em_df.columns:block for the Oarfish branch looks inconsistent with the surrounding code and may cause readability or execution issues; align it with the Salmon branch pattern.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In the greedy assignment loop in `_match_all_isoforms_for_plotting`, the condition `if similarity < min_similarity_for_matching: continue # remaining pairs will only be worse` should likely `break` instead of `continue`, since pairs are sorted descending and anything after the first below-threshold similarity can’t pass the filter.
- The new greedy matching builds a full `similarity_matrix` dict of all ref-isoform × hap-transcript pairs; if these sets are large, consider using a more memory-efficient structure (e.g. a list of `(ref_iso_id, idx, similarity)` tuples) or computing similarities incrementally to avoid materializing the full dense matrix.
- In `_load_sample_counts`, the indentation of `result`’s keys and the `if 'len' in em_df.columns:` block for the Oarfish branch looks inconsistent with the surrounding code and may cause readability or execution issues; align it with the Salmon branch pattern.
## Individual Comments
### Comment 1
<location path="polyase/ase_data_loader.py" line_range="85-83" />
<code_context>
result['em_counts'] = em_df['num_reads']
+ if 'len' in em_df.columns:
+ result['tx_lengths'] = em_df['len']
elif 'Name' in em_df.columns and 'NumReads' in em_df.columns:
print("Salmon quant.sf format detected")
em_df = em_df.set_index('Name')
result['em_counts'] = em_df['NumReads']
+ if 'len' in em_df.columns:
+ result['tx_lengths'] = em_df['len']
else:
</code_context>
<issue_to_address>
**suggestion:** Length column detection for Salmon quant.sf may not match typical column naming.
Salmon `quant.sf` typically uses a `Length` column (capital L), so checking only for `len` may leave `tx_lengths` unset for common outputs. Please handle both (e.g., use `len` if present, otherwise fall back to `Length`).
Suggested implementation:
```python
em_df = em_df.set_index('tname')
result['em_counts'] = em_df['num_reads']
if 'len' in em_df.columns:
result['tx_lengths'] = em_df['len']
elif 'Length' in em_df.columns:
result['tx_lengths'] = em_df['Length']
```
```python
em_df = em_df.set_index('Name')
result['em_counts'] = em_df['NumReads']
if 'len' in em_df.columns:
result['tx_lengths'] = em_df['len']
elif 'Length' in em_df.columns:
result['tx_lengths'] = em_df['Length']
```
</issue_to_address>
### Comment 2
<location path="polyase/ase_data_loader.py" line_range="546-551" />
<code_context>
+ print("No 'tx_length' column found in transcript .var, skipping length aggregation")
+
+ # --- Per-Synt_id: flag whether all transcripts share the same length ---
+ if 'Synt_id' in adata_tx.var.columns and 'tx_length' in adata_tx.var.columns:
+ synt_tx = adata_tx.var[['gene_id', 'Synt_id', 'tx_length']].dropna(subset=['Synt_id', 'tx_length'])
+ synt_length_nunique = synt_tx.groupby('Synt_id')['tx_length'].nunique()
+ synt_uniform = synt_length_nunique.map(lambda n: 'uniform_length' if n == 1 else 'variable_length')
+ gene_synt = tx_var_valid[['gene_id', 'Synt_id']].drop_duplicates('gene_id').set_index('gene_id')
+ gene_var['synt_length_category'] = (
+ gene_synt['Synt_id']
+ .map(synt_uniform)
</code_context>
<issue_to_address>
**question:** The synt_length_category computation ignores transcripts without lengths, which may misclassify some Synt_id groups.
Since `synt_tx` drops rows with missing `tx_length`, `nunique()` only reflects transcripts that have a length. If some transcripts in a `Synt_id` lack length, a genuinely variable-length group could be marked `uniform_length` based on the incomplete subset. If this distinction is important downstream, consider marking any `Synt_id` with partially missing lengths as `variable_length` or introducing a separate category instead of ignoring those transcripts.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| print("Oarfish quant.sf format detected") | ||
| em_df = em_df.set_index('tname') | ||
| result['em_counts'] = em_df['num_reads'] | ||
| if 'len' in em_df.columns: |
There was a problem hiding this comment.
suggestion: Length column detection for Salmon quant.sf may not match typical column naming.
Salmon quant.sf typically uses a Length column (capital L), so checking only for len may leave tx_lengths unset for common outputs. Please handle both (e.g., use len if present, otherwise fall back to Length).
Suggested implementation:
em_df = em_df.set_index('tname')
result['em_counts'] = em_df['num_reads']
if 'len' in em_df.columns:
result['tx_lengths'] = em_df['len']
elif 'Length' in em_df.columns:
result['tx_lengths'] = em_df['Length'] em_df = em_df.set_index('Name')
result['em_counts'] = em_df['NumReads']
if 'len' in em_df.columns:
result['tx_lengths'] = em_df['len']
elif 'Length' in em_df.columns:
result['tx_lengths'] = em_df['Length']| if 'Synt_id' in adata_tx.var.columns and 'tx_length' in adata_tx.var.columns: | ||
| synt_tx = adata_tx.var[['gene_id', 'Synt_id', 'tx_length']].dropna(subset=['Synt_id', 'tx_length']) | ||
| synt_length_nunique = synt_tx.groupby('Synt_id')['tx_length'].nunique() | ||
| synt_uniform = synt_length_nunique.map(lambda n: 'uniform_length' if n == 1 else 'variable_length') | ||
| gene_synt = tx_var_valid[['gene_id', 'Synt_id']].drop_duplicates('gene_id').set_index('gene_id') | ||
| gene_var['synt_length_category'] = ( |
There was a problem hiding this comment.
question: The synt_length_category computation ignores transcripts without lengths, which may misclassify some Synt_id groups.
Since synt_tx drops rows with missing tx_length, nunique() only reflects transcripts that have a length. If some transcripts in a Synt_id lack length, a genuinely variable-length group could be marked uniform_length based on the incomplete subset. If this distinction is important downstream, consider marking any Synt_id with partially missing lengths as variable_length or introducing a separate category instead of ignoring those transcripts.
When matching reference isoforms to transcripts in other haplotypes, each reference isoform searched the full transcript pool independently, meaning two different reference isoforms could claim the same target transcript — causing both to report identical counts and, in the worst case, isoform_ratio = 1.0 for both, which is mathematically impossible. The fix replaces this independent matching with a greedy one-to-one assignment: similarity scores are computed for all reference isoform × haplotype transcript pairs upfront, then assigned in descending order of similarity so each haplotype transcript can only be claimed once. Only the plotting data generation is affected — the statistical testing logic is unchanged.
Summary by Sourcery
Fix isoform-to-transcript matching for plotting and enrich ASE data with transcript length metadata and gene-level length summaries.
Bug Fixes:
Enhancements: