Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
191 changes: 191 additions & 0 deletions .github-issue-primer-trimming-tags.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
# Add Primer/Adapter Trimming Tags to BAM Files

## Background

The aa-tRNA-seq pipeline aligns reads to reference sequences that include both 5' and 3' adapters flanking the tRNA sequences:
- **5' adapter**: 24 bp (`CCTAAGAGCAAGAAGAAGCCTGGN`)
- **3' adapter**: 40 bp (discriminating region starts at position 23 from 3' end)
- **tRNA sequence**: Variable length in the middle

Currently, the pipeline retains full alignment coordinates including adapters, which is appropriate for the Remora classification step (which analyzes signal over the CCA-adapter junction). However, for downstream analysis, it would be valuable to have explicit tags indicating the boundaries between adapters and tRNA sequences.

## Current Pipeline Behavior

1. Reads are aligned to references containing adapters (`bwa_align` rule)
2. Classification happens on these full-length alignments (`classify_charging` rule)
3. Tags are transferred to create final BAM (`transfer_bam_tags` rule)
4. **No explicit primer/adapter coordinate information is stored in BAM tags**

The `filter_reads.py` script exists with logic to identify full-length reads based on adapter presence, but this script is not currently integrated into the pipeline workflow.

## Proposed Approaches

After analyzing several options, I recommend **Option 5** as the most robust solution:

### ❌ Option 1: Hard/Soft Clipping (NOT RECOMMENDED)
Modify CIGAR strings to clip adapter sequences.
- **Pros**: Standard SAM/BAM approach
- **Cons**:
- Destructive for hard clipping (loses sequence)
- Complicates Remora analysis (needs CCA-adapter signal)
- Breaks current reference-based filtering logic
- Would require re-alignment or complex CIGAR manipulation

### ❌ Option 2: Separate Trimmed BAM (NOT RECOMMENDED)
Create a parallel trimmed BAM output.
- **Pros**: Preserves both versions
- **Cons**:
- Storage overhead
- Workflow complexity
- Potential user confusion about which file to use

### ✅ Option 5: Add Custom Coordinate Tags (RECOMMENDED)

Add custom BAM tags indicating tRNA boundaries while preserving original alignments.

**Proposed tag schema:**

```
ts:i:X # tRNA start - reference coordinate where tRNA begins (0-based)
te:i:X # tRNA end - reference coordinate where tRNA ends (0-based)
a5:i:X # actual 5' adapter length found in this alignment
a3:i:X # actual 3' adapter length found in this alignment
```

**Example for a full-length read:**
```
Read aligned to reference: positions 0-150
Reference structure: [24bp 5' adapter][~80bp tRNA][40bp 3' adapter]

Tags added:
ts:i:24 # tRNA starts at position 24
te:i:110 # tRNA ends at position 110
a5:i:24 # Full 5' adapter present
a3:i:40 # Full 3' adapter present
```

**Example for a truncated read:**
```
Read aligned to reference: positions 5-150 (missing first 5bp of 5' adapter)
Reference structure: [19bp 5' adapter][~80bp tRNA][40bp 3' adapter]

Tags added:
ts:i:24 # tRNA starts at reference position 24 (standard)
te:i:110 # tRNA ends at reference position 110
a5:i:19 # Only 19bp of 5' adapter present
a3:i:40 # Full 3' adapter present
```

### Why This Approach is Most Robust

1. **Non-destructive**: Preserves original alignment coordinates needed for Remora
2. **Handles truncation**: Explicitly stores actual adapter lengths found
3. **Clear semantics**: Unambiguous meaning for downstream analysis
4. **Flexible**: Can be used to extract trimmed sequences or calculate lengths
5. **Compatible**: Doesn't interfere with existing pipeline logic
6. **Informative**: Provides both boundaries and actual adapter presence

## Implementation Strategy

### Approach A: New Snakemake Rule (Recommended)
Add a new rule between `transfer_bam_tags` and downstream analysis:

```python
rule add_adapter_tags:
"""
Add tags indicating tRNA boundaries and adapter lengths
"""
input:
bam=rules.transfer_bam_tags.output.classified_bam,
bai=rules.transfer_bam_tags.output.classified_bam_bai,
output:
tagged_bam=os.path.join(outdir, "bam", "final", "{sample}.tagged.bam"),
tagged_bai=os.path.join(outdir, "bam", "final", "{sample}.tagged.bam.bai"),
params:
src=SCRIPT_DIR,
adapter_5p_len=24, # From config
adapter_3p_len=40, # From config
shell:
"""
python {params.src}/add_adapter_tags.py \
--input {input.bam} \
--output {output.tagged_bam} \
--adapter-5p {params.adapter_5p_len} \
--adapter-3p {params.adapter_3p_len}

samtools index {output.tagged_bam}
"""
```

**New script: `workflow/scripts/add_adapter_tags.py`**

This script would:
1. Read each aligned read
2. Calculate adapter boundaries based on `reference_start`, `reference_end`, and reference length
3. Determine actual adapter lengths (accounting for truncation)
4. Add the four custom tags (`ts`, `te`, `a5`, `a3`)
5. Write modified BAM

### Approach B: Integrate into transfer_tags.py
Modify the existing `transfer_tags.py` to also calculate and add adapter tags.

- **Pros**: Single pass through BAM
- **Cons**: Mixes responsibilities (tag transfer vs tag creation)

**Recommendation**: Use Approach A for cleaner separation of concerns.

## Configuration Changes

Add adapter length parameters to `config/config-base.yml`:

```yaml
# Adapter lengths for primer trimming tags
adapter_5p_length: 24 # 5' adapter length in reference
adapter_3p_length: 40 # 3' adapter length in reference
```

## Edge Cases to Handle

1. **Unmapped reads**: Skip tag addition
2. **Secondary/supplementary alignments**: Skip or apply same logic as primary
3. **Truncated adapters**: Store actual lengths found (see example above)
4. **Reads shorter than adapters**: Flag appropriately (could add `tf:i` tag for "truncation flags")
5. **Reverse strand**: Calculate coordinates appropriately (though current pipeline filters for positive strand)

## Validation Strategy

1. **Unit tests**: Test `add_adapter_tags.py` with synthetic alignments
2. **Integration test**: Run on test dataset and verify:
- All passing reads have expected tags
- Tag values match expected adapter boundaries
- Truncated reads have reduced `a5`/`a3` values
3. **Downstream compatibility**: Verify tags don't break existing analysis scripts

## Documentation Updates Needed

1. Update `CLAUDE.md` with new tags and their meanings
2. Update `README.md` to mention primer trimming tags in output
3. Add docstrings to new script explaining tag format
4. Update workflow DAG if rule structure changes

## Alternative: Standardized Tag Names

If there's an emerging standard for primer/adapter tags in the ONT community, we should consider using those. However, as of now, custom tags seem to be the norm for this type of metadata.

Possible alternatives:
- `pt:Z:24-110` (single string format: "primer trim coordinates")
- `BC:Z:...` and `QT:Z:...` (barcode/quality trim - but semantically different)
- Store in SAM optional fields with better namespace (e.g., `ad:Z:5=24,3=40`)

## References

- SAM/BAM format specification: https://samtools.github.io/hts-specs/SAMv1.pdf
- Current pipeline filtering logic: `workflow/scripts/filter_reads.py`
- Remora classification documentation: Uses 6-mer kmer at CCA-adapter junction

## Questions for Discussion

1. Should we integrate `filter_reads.py` into the pipeline as part of this work?
2. Should adapter lengths be reference-specific (stored per-sequence) or global config?
3. Do we want to add a "trimmed view" output format for tools that can't parse custom tags?
4. Should we version the tag format in case it needs to change later?
67 changes: 67 additions & 0 deletions HOW_TO_FILE_ISSUE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# How to File the Primer Trimming Tags Issue on GitHub

Since the GitHub CLI is not available in this environment, please file the issue manually using the content in `ISSUE_PRIMER_TRIMMING_TAGS.md`.

## Steps to File Issue

1. **Navigate to the repository**: https://github.com/rnabioco/aa-tRNA-seq-pipeline/issues

2. **Click "New Issue"**

3. **Title**: `Add Primer/Adapter Trimming Tags to BAM Files`

4. **Labels to add** (if available):
- `enhancement`
- `pipeline`
- `documentation`

5. **Body**: Copy the entire content of `ISSUE_PRIMER_TRIMMING_TAGS.md`

## Quick Summary for Issue Description

If you prefer a shorter issue, here's a condensed version:

---

### Title
Add Primer/Adapter Trimming Tags to BAM Files

### Description

**Problem**: The pipeline aligns reads to references containing adapters, but adapter boundaries are not explicitly stored in BAM tags, requiring downstream tools to infer them from reference structure.

**Proposed Solution**: Add four custom BAM tags to store tRNA boundaries and adapter information:
- `ts:i` - tRNA start position (reference coordinate)
- `te:i` - tRNA end position (reference coordinate)
- `a5:i` - actual 5' adapter length in alignment
- `a3:i` - actual 3' adapter length in alignment

**Benefits**:
- Non-destructive (preserves alignments needed for Remora)
- Explicitly stores adapter boundaries
- Handles truncated adapters correctly
- Enables easier downstream analysis

**Implementation**: See `ISSUE_PRIMER_TRIMMING_TAGS.md` for detailed implementation plan including:
- New script: `workflow/scripts/add_adapter_tags.py`
- New rule: `add_adapter_tags`
- Config updates: Add `adapter_5p_length` and `adapter_3p_length` parameters
- Update downstream rules to use tagged BAM

See `ISSUE_PRIMER_TRIMMING_TAGS.md` and `.github-issue-primer-trimming-tags.md` for complete analysis and implementation details.

---

## Files in This Branch

- **ISSUE_PRIMER_TRIMMING_TAGS.md**: Complete issue description with implementation plan (use this as the issue body)
- **.github-issue-primer-trimming-tags.md**: Detailed technical analysis of different approaches
- **HOW_TO_FILE_ISSUE.md**: This file

## Next Steps After Filing Issue

1. Get feedback from maintainers on the proposed approach
2. Discuss any concerns about tag naming or format
3. Confirm adapter length configuration approach
4. Implement the solution once approved
5. Add tests and documentation
Loading