rnabioco · jayhesselberth · Nov 5, 2025 · Nov 5, 2025
diff --git a/.github-issue-primer-trimming-tags.md b/.github-issue-primer-trimming-tags.md
@@ -0,0 +1,191 @@
+# Add Primer/Adapter Trimming Tags to BAM Files
+
+## Background
+
+The aa-tRNA-seq pipeline aligns reads to reference sequences that include both 5' and 3' adapters flanking the tRNA sequences:
+- **5' adapter**: 24 bp (`CCTAAGAGCAAGAAGAAGCCTGGN`)
+- **3' adapter**: 40 bp (discriminating region starts at position 23 from 3' end)
+- **tRNA sequence**: Variable length in the middle
+
+Currently, the pipeline retains full alignment coordinates including adapters, which is appropriate for the Remora classification step (which analyzes signal over the CCA-adapter junction). However, for downstream analysis, it would be valuable to have explicit tags indicating the boundaries between adapters and tRNA sequences.
+
+## Current Pipeline Behavior
+
+1. Reads are aligned to references containing adapters (`bwa_align` rule)
+2. Classification happens on these full-length alignments (`classify_charging` rule)
+3. Tags are transferred to create final BAM (`transfer_bam_tags` rule)
+4. **No explicit primer/adapter coordinate information is stored in BAM tags**
+
+The `filter_reads.py` script exists with logic to identify full-length reads based on adapter presence, but this script is not currently integrated into the pipeline workflow.
+
+## Proposed Approaches
+
+After analyzing several options, I recommend **Option 5** as the most robust solution:
+
+### ❌ Option 1: Hard/Soft Clipping (NOT RECOMMENDED)
+Modify CIGAR strings to clip adapter sequences.
+- **Pros**: Standard SAM/BAM approach
+- **Cons**:
+  - Destructive for hard clipping (loses sequence)
+  - Complicates Remora analysis (needs CCA-adapter signal)
+  - Breaks current reference-based filtering logic
+  - Would require re-alignment or complex CIGAR manipulation
+
+### ❌ Option 2: Separate Trimmed BAM (NOT RECOMMENDED)
+Create a parallel trimmed BAM output.
+- **Pros**: Preserves both versions
+- **Cons**:
+  - Storage overhead
+  - Workflow complexity
+  - Potential user confusion about which file to use
+
+### ✅ Option 5: Add Custom Coordinate Tags (RECOMMENDED)
+
+Add custom BAM tags indicating tRNA boundaries while preserving original alignments.
+
+**Proposed tag schema:**
+
+```
+ts:i:X    # tRNA start - reference coordinate where tRNA begins (0-based)
+te:i:X    # tRNA end - reference coordinate where tRNA ends (0-based)
+a5:i:X    # actual 5' adapter length found in this alignment
+a3:i:X    # actual 3' adapter length found in this alignment
+```
+
+**Example for a full-length read:**
+```
+Read aligned to reference: positions 0-150
+Reference structure: [24bp 5' adapter][~80bp tRNA][40bp 3' adapter]
+
+Tags added:
+ts:i:24    # tRNA starts at position 24
+te:i:110   # tRNA ends at position 110
+a5:i:24    # Full 5' adapter present
+a3:i:40    # Full 3' adapter present
+```
+
+**Example for a truncated read:**
+```
+Read aligned to reference: positions 5-150 (missing first 5bp of 5' adapter)
+Reference structure: [19bp 5' adapter][~80bp tRNA][40bp 3' adapter]
+
+Tags added:
+ts:i:24    # tRNA starts at reference position 24 (standard)
+te:i:110   # tRNA ends at reference position 110
+a5:i:19    # Only 19bp of 5' adapter present
+a3:i:40    # Full 3' adapter present
+```
+
+### Why This Approach is Most Robust
+
+1. **Non-destructive**: Preserves original alignment coordinates needed for Remora
+2. **Handles truncation**: Explicitly stores actual adapter lengths found
+3. **Clear semantics**: Unambiguous meaning for downstream analysis
+4. **Flexible**: Can be used to extract trimmed sequences or calculate lengths
+5. **Compatible**: Doesn't interfere with existing pipeline logic
+6. **Informative**: Provides both boundaries and actual adapter presence
+
+## Implementation Strategy
+
+### Approach A: New Snakemake Rule (Recommended)
+Add a new rule between `transfer_bam_tags` and downstream analysis:
+
+```python
+rule add_adapter_tags:
+    """
+    Add tags indicating tRNA boundaries and adapter lengths
+    """
+    input:
+        bam=rules.transfer_bam_tags.output.classified_bam,
+        bai=rules.transfer_bam_tags.output.classified_bam_bai,
+    output:
+        tagged_bam=os.path.join(outdir, "bam", "final", "{sample}.tagged.bam"),
+        tagged_bai=os.path.join(outdir, "bam", "final", "{sample}.tagged.bam.bai"),
+    params:
+        src=SCRIPT_DIR,
+        adapter_5p_len=24,  # From config
+        adapter_3p_len=40,  # From config
+    shell:
+        """
+        python {params.src}/add_adapter_tags.py \
+          --input {input.bam} \
+          --output {output.tagged_bam} \
+          --adapter-5p {params.adapter_5p_len} \
+          --adapter-3p {params.adapter_3p_len}
+
+        samtools index {output.tagged_bam}
+        """
+```
+
+**New script: `workflow/scripts/add_adapter_tags.py`**
+
+This script would:
+1. Read each aligned read
+2. Calculate adapter boundaries based on `reference_start`, `reference_end`, and reference length
+3. Determine actual adapter lengths (accounting for truncation)
+4. Add the four custom tags (`ts`, `te`, `a5`, `a3`)
+5. Write modified BAM
+
+### Approach B: Integrate into transfer_tags.py
+Modify the existing `transfer_tags.py` to also calculate and add adapter tags.
+
+- **Pros**: Single pass through BAM
+- **Cons**: Mixes responsibilities (tag transfer vs tag creation)
+
+**Recommendation**: Use Approach A for cleaner separation of concerns.
+
+## Configuration Changes
+
+Add adapter length parameters to `config/config-base.yml`:
+
+```yaml
+# Adapter lengths for primer trimming tags
+adapter_5p_length: 24  # 5' adapter length in reference
+adapter_3p_length: 40  # 3' adapter length in reference
+```
+
+## Edge Cases to Handle
+
+1. **Unmapped reads**: Skip tag addition
+2. **Secondary/supplementary alignments**: Skip or apply same logic as primary
+3. **Truncated adapters**: Store actual lengths found (see example above)
+4. **Reads shorter than adapters**: Flag appropriately (could add `tf:i` tag for "truncation flags")
+5. **Reverse strand**: Calculate coordinates appropriately (though current pipeline filters for positive strand)
+
+## Validation Strategy
+
+1. **Unit tests**: Test `add_adapter_tags.py` with synthetic alignments
+2. **Integration test**: Run on test dataset and verify:
+   - All passing reads have expected tags
+   - Tag values match expected adapter boundaries
+   - Truncated reads have reduced `a5`/`a3` values
+3. **Downstream compatibility**: Verify tags don't break existing analysis scripts
+
+## Documentation Updates Needed
+
+1. Update `CLAUDE.md` with new tags and their meanings
+2. Update `README.md` to mention primer trimming tags in output
+3. Add docstrings to new script explaining tag format
+4. Update workflow DAG if rule structure changes
+
+## Alternative: Standardized Tag Names
+
+If there's an emerging standard for primer/adapter tags in the ONT community, we should consider using those. However, as of now, custom tags seem to be the norm for this type of metadata.
+
+Possible alternatives:
+- `pt:Z:24-110` (single string format: "primer trim coordinates")
+- `BC:Z:...` and `QT:Z:...` (barcode/quality trim - but semantically different)
+- Store in SAM optional fields with better namespace (e.g., `ad:Z:5=24,3=40`)
+
+## References
+
+- SAM/BAM format specification: https://samtools.github.io/hts-specs/SAMv1.pdf
+- Current pipeline filtering logic: `workflow/scripts/filter_reads.py`
+- Remora classification documentation: Uses 6-mer kmer at CCA-adapter junction
+
+## Questions for Discussion
+
+1. Should we integrate `filter_reads.py` into the pipeline as part of this work?
+2. Should adapter lengths be reference-specific (stored per-sequence) or global config?
+3. Do we want to add a "trimmed view" output format for tools that can't parse custom tags?
+4. Should we version the tag format in case it needs to change later?
diff --git a/HOW_TO_FILE_ISSUE.md b/HOW_TO_FILE_ISSUE.md
@@ -0,0 +1,67 @@
+# How to File the Primer Trimming Tags Issue on GitHub
+
+Since the GitHub CLI is not available in this environment, please file the issue manually using the content in `ISSUE_PRIMER_TRIMMING_TAGS.md`.
+
+## Steps to File Issue
+
+1. **Navigate to the repository**: https://github.com/rnabioco/aa-tRNA-seq-pipeline/issues
+
+2. **Click "New Issue"**
+
+3. **Title**: `Add Primer/Adapter Trimming Tags to BAM Files`
+
+4. **Labels to add** (if available):
+   - `enhancement`
+   - `pipeline`
+   - `documentation`
+
+5. **Body**: Copy the entire content of `ISSUE_PRIMER_TRIMMING_TAGS.md`
+
+## Quick Summary for Issue Description
+
+If you prefer a shorter issue, here's a condensed version:
+
+---
+
+### Title
+Add Primer/Adapter Trimming Tags to BAM Files
+
+### Description
+
+**Problem**: The pipeline aligns reads to references containing adapters, but adapter boundaries are not explicitly stored in BAM tags, requiring downstream tools to infer them from reference structure.
+
+**Proposed Solution**: Add four custom BAM tags to store tRNA boundaries and adapter information:
+- `ts:i` - tRNA start position (reference coordinate)
+- `te:i` - tRNA end position (reference coordinate)
+- `a5:i` - actual 5' adapter length in alignment
+- `a3:i` - actual 3' adapter length in alignment
+
+**Benefits**:
+- Non-destructive (preserves alignments needed for Remora)
+- Explicitly stores adapter boundaries
+- Handles truncated adapters correctly
+- Enables easier downstream analysis
+
+**Implementation**: See `ISSUE_PRIMER_TRIMMING_TAGS.md` for detailed implementation plan including:
+- New script: `workflow/scripts/add_adapter_tags.py`
+- New rule: `add_adapter_tags`
+- Config updates: Add `adapter_5p_length` and `adapter_3p_length` parameters
+- Update downstream rules to use tagged BAM
+
+See `ISSUE_PRIMER_TRIMMING_TAGS.md` and `.github-issue-primer-trimming-tags.md` for complete analysis and implementation details.
+
+---
+
+## Files in This Branch
+
+- **ISSUE_PRIMER_TRIMMING_TAGS.md**: Complete issue description with implementation plan (use this as the issue body)
+- **.github-issue-primer-trimming-tags.md**: Detailed technical analysis of different approaches
+- **HOW_TO_FILE_ISSUE.md**: This file
+
+## Next Steps After Filing Issue
+
+1. Get feedback from maintainers on the proposed approach
+2. Discuss any concerns about tag naming or format
+3. Confirm adapter length configuration approach
+4. Implement the solution once approved
+5. Add tests and documentation