From d9cb2bf5bae8edece0f62359b249c425596176f6 Mon Sep 17 00:00:00 2001
From: Claude <noreply@anthropic.com>
Date: Wed, 5 Nov 2025 18:19:37 +0000
Subject: [PATCH] Add comprehensive plan for improved sample ingestion

Document detailed proposal for flexible sample sheet format that supports:
- Dorado demux workflow integration
- WarpDemuX barcode demultiplexing
- Multiple input types (raw pod5, basecalled BAM, merged pod5)
- Backward compatibility with existing TSV format
- Metadata tracking and validation

Includes implementation plan, code examples, and testing strategy.
---
 ISSUE_IMPROVED_SAMPLE_INGESTION.md | 290 +++++++++++++++++++++++++++++
 1 file changed, 290 insertions(+)
 create mode 100644 ISSUE_IMPROVED_SAMPLE_INGESTION.md

diff --git a/ISSUE_IMPROVED_SAMPLE_INGESTION.md b/ISSUE_IMPROVED_SAMPLE_INGESTION.md
new file mode 100644
index 0000000..b223fac
--- /dev/null
+++ b/ISSUE_IMPROVED_SAMPLE_INGESTION.md
@@ -0,0 +1,290 @@
+# Issue: Improve sample ingestion to support dorado demux and WarpDemuX workflows
+
+## Problem Statement
+
+The current sample ingestion approach is limited to raw pod5 files in a rigid directory structure. This doesn't accommodate modern demultiplexing workflows like dorado demux and WarpDemuX (especially WarpDemuX-tRNA), which are increasingly important for multiplexed tRNA sequencing experiments.
+
+## Current Limitations
+
+The existing implementation (`workflow/rules/common.smk:10-87`) has several constraints:
+
+1. **Rigid TSV format**: Only accepts 2 columns (sample_id, data_path)
+2. **Raw pod5 only**: Expects specific directory structure (pod5_pass/pod5_fail/pod5 subdirectories)
+3. **No demux support**: Cannot handle pre-demultiplexed data from dorado or WarpDemuX
+4. **Always rebasecalls**: Cannot skip rebasecalling even when data is already basecalled
+5. **No barcode awareness**: Cannot selectively process specific barcodes from multiplexed runs
+6. **No metadata capture**: Cannot track experimental conditions or sample relationships
+
+## Proposed Solution
+
+### 1. Flexible CSV Sample Sheet Format
+
+Replace the 2-column TSV with a comprehensive CSV format:
+
+```csv
+sample_id,input_type,input_path,barcode,flow_cell_id,experiment_id,metadata
+sample1,raw_pod5,/path/to/run1,,,231212_exp1,
+sample2,raw_pod5,/path/to/run2,,,231212_exp2,condition=treated
+sample3,dorado_demux,/path/to/demux_output,SQK-NBD114-96_barcode01,PAO12345,231212_exp3,
+sample4,dorado_demux,/path/to/demux_output,SQK-NBD114-96_barcode02,PAO12345,231212_exp3,
+sample5,warpdemux,/path/to/warpdemux_output,1,,,
+sample6,merged_pod5,/path/to/merged.pod5,,,231212_exp4,
+sample7,basecalled_bam,/path/to/basecalled.bam,,,231212_exp5,
+```
+
+**Column Definitions:**
+
+- **sample_id** (required): Unique sample identifier
+- **input_type** (required): One of:
+  - `raw_pod5`: Raw sequencing run directory (current behavior)
+  - `dorado_demux`: Pre-demultiplexed BAM files from dorado demux
+  - `warpdemux`: WarpDemuX prediction CSV output
+  - `merged_pod5`: Pre-merged single pod5 file
+  - `basecalled_bam`: Pre-basecalled BAM file with move tables
+- **input_path** (required): Path to input data
+- **barcode** (optional): Barcode identifier for demuxed samples
+- **flow_cell_id** (optional): Flow cell ID for tracking
+- **experiment_id** (optional): Experiment identifier
+- **metadata** (optional): Key=value pairs for additional metadata
+
+### 2. Implementation Architecture
+
+#### A. Enhanced Sample Parsing
+
+```python
+def parse_samples_v2(sample_sheet):
+    """
+    Parse flexible CSV sample sheet supporting multiple input types.
+
+    Returns:
+        Dict structure:
+        {
+            'sample1': {
+                'input_type': 'raw_pod5',
+                'input_path': '/path/to/run1',
+                'barcode': None,
+                'flow_cell_id': 'PAO12345',
+                'experiment_id': '231212_exp1',
+                'metadata': {},
+                'raw_files': [...],
+                'start_rule': 'merge_pods'  # Entry point in workflow
+            }
+        }
+    """
+```
+
+#### B. Input Type Handlers
+
+Each input type needs a specific handler:
+
+```python
+def find_raw_inputs_v2(sample_dict):
+    """Route to appropriate handler based on input_type"""
+    handlers = {
+        'raw_pod5': handle_raw_pod5,
+        'dorado_demux': handle_dorado_demux,
+        'warpdemux': handle_warpdemux,
+        'merged_pod5': handle_merged_pod5,
+        'basecalled_bam': handle_basecalled_bam
+    }
+
+    for sample, info in sample_dict.items():
+        handler = handlers[info['input_type']]
+        handler(sample, info)
+
+    return sample_dict
+```
+
+**Handler Functions:**
+
+- `handle_raw_pod5()`: Current behavior (find pod5 files in subdirectories)
+- `handle_dorado_demux()`: Find demuxed BAM file matching barcode, skip rebasecalling
+- `handle_warpdemux()`: Parse WarpDemuX predictions CSV and filter pod5 by barcode
+- `handle_merged_pod5()`: Single pre-merged pod5 file, start at rebasecall
+- `handle_basecalled_bam()`: Pre-basecalled BAM, start at ubam_to_fastq
+
+#### C. Workflow Entry Point Logic
+
+Modify rules to support conditional entry points:
+
+```python
+def get_fastq_input(wildcards):
+    """
+    Determine input for ubam_to_fastq based on input_type.
+    """
+    input_type = samples[wildcards.sample]['input_type']
+
+    if input_type in ['raw_pod5', 'merged_pod5', 'warpdemux']:
+        return rules.rebasecall.output
+    elif input_type in ['dorado_demux', 'basecalled_bam']:
+        return samples[wildcards.sample]['input_path']
+```
+
+### 3. New Rules for Demux Workflows
+
+#### A. WarpDemuX Integration
+
+```python
+rule filter_warpdemux_pods:
+    """
+    Filter pod5 files based on WarpDemuX barcode predictions.
+    """
+    input:
+        predictions=lambda wc: samples[wc.sample]['predictions_file'],
+        pod5_dir=lambda wc: samples[wc.sample]['pod5_dir']
+    output:
+        os.path.join(outdir, "pod5", "{sample}", "{sample}.filtered.pod5")
+    params:
+        barcode=lambda wc: samples[wc.sample]['barcode']
+    shell:
+        """
+        python {SCRIPT_DIR}/filter_warpdemux_reads.py \
+            --predictions {input.predictions} \
+            --pod5-dir {input.pod5_dir} \
+            --barcode {params.barcode} \
+            --output {output}
+        """
+```
+
+#### B. Dorado Demux Input
+
+```python
+rule link_dorado_demux:
+    """
+    Create symbolic link to dorado demuxed BAM.
+    """
+    input:
+        lambda wc: samples[wc.sample]['demux_bam']
+    output:
+        os.path.join(outdir, "bam", "rebasecall", "{sample}", "{sample}.rbc.bam")
+    shell:
+        "ln -s {input} {output}"
+```
+
+### 4. Backward Compatibility
+
+Auto-detect sample sheet format:
+
+```python
+def parse_samples(sample_file):
+    """
+    Auto-detect sample sheet format and parse accordingly.
+    - .csv extension: use parse_samples_v2()
+    - .tsv extension: use legacy format
+    """
+    ext = os.path.splitext(sample_file)[1]
+
+    if ext == '.csv':
+        return parse_samples_v2(sample_file)
+    else:
+        # Legacy TSV: convert to v2 structure with defaults
+        samples = parse_samples_legacy(sample_file)
+        for sample, info in samples.items():
+            info.update({
+                'input_type': 'raw_pod5',
+                'barcode': None,
+                'start_rule': 'merge_pods'
+            })
+        return samples
+```
+
+### 5. Helper Scripts
+
+**New scripts to implement:**
+
+1. **`workflow/scripts/filter_warpdemux_reads.py`**
+   - Parse WarpDemuX predictions CSV
+   - Filter read IDs by barcode confidence threshold
+   - Extract matching reads from pod5 files using pod5 Python API
+
+2. **`workflow/scripts/validate_sample_sheet.py`**
+   - Validate sample sheet format
+   - Check required columns
+   - Verify file/directory paths exist
+   - Check for duplicate sample_ids
+   - Validate barcode formats
+
+3. **`workflow/scripts/convert_dorado_samplesheet.py`**
+   - Convert dorado demux sample sheet to pipeline format
+   - Map aliases to sample_ids
+   - Auto-detect demuxed BAM file paths
+
+### 6. Configuration Updates
+
+Add to `config/config-base.yml`:
+
+```yaml
+# Sample sheet format version
+sample_sheet_version: 2
+
+# Demux-related options
+demux:
+  # WarpDemuX confidence threshold (0-1)
+  warpdemux_confidence: 0.99
+
+  # Skip rebasecalling for pre-basecalled inputs
+  skip_rebasecall_when_possible: true
+
+  # Dorado demux file naming pattern
+  dorado_demux_pattern: "{barcode}.bam"
+```
+
+## Implementation Plan
+
+### Phase 1: Core Infrastructure
+- [ ] Implement `parse_samples_v2()` with CSV parsing
+- [ ] Add input type handler framework
+- [ ] Implement backward compatibility layer
+- [ ] Add unit tests for sample parsing
+
+### Phase 2: Input Type Handlers
+- [ ] Implement `handle_raw_pod5()` (refactor existing code)
+- [ ] Implement `handle_merged_pod5()`
+- [ ] Implement `handle_basecalled_bam()`
+- [ ] Add conditional input functions to rules
+
+### Phase 3: Demux Support
+- [ ] Implement `handle_dorado_demux()`
+- [ ] Add `link_dorado_demux` rule
+- [ ] Create `convert_dorado_samplesheet.py` script
+- [ ] Test with dorado demux data
+
+### Phase 4: WarpDemuX Support
+- [ ] Implement `handle_warpdemux()`
+- [ ] Add `filter_warpdemux_pods` rule
+- [ ] Create `filter_warpdemux_reads.py` script
+- [ ] Test with WarpDemuX-tRNA data
+
+### Phase 5: Validation & Documentation
+- [ ] Create `validate_sample_sheet.py` script
+- [ ] Add validation to workflow startup
+- [ ] Update CLAUDE.md with new sample sheet format
+- [ ] Create example sample sheets for each input type
+- [ ] Add migration guide from TSV to CSV format
+
+## Benefits
+
+1. **Flexibility**: Supports multiple workflow entry points
+2. **Efficiency**: Skips unnecessary steps (e.g., rebasecalling when already done)
+3. **Dorado Integration**: Native support for dorado demux sample sheets
+4. **WarpDemuX Support**: Handles tRNA-optimized demultiplexing
+5. **Backward Compatible**: Existing TSV files continue to work
+6. **Extensible**: Easy to add new input_type handlers
+7. **Metadata**: Captures experimental metadata for downstream analysis
+8. **Validation**: Sample sheets validated before workflow execution
+9. **Resource Efficiency**: Can process only specific barcodes from large multiplexed runs
+
+## Testing Strategy
+
+Create test configs for each input type:
+- `config/samples-test-raw.csv` (current behavior)
+- `config/samples-test-dorado-demux.csv`
+- `config/samples-test-warpdemux.csv`
+- `config/samples-test-mixed.csv` (multiple input types in one run)
+- `config/samples-test-legacy.tsv` (backward compatibility)
+
+## References
+
+- [Dorado Sample Sheets Documentation](https://software-docs.nanoporetech.com/dorado/latest/barcoding/sample_sheet/)
+- [WarpDemuX GitHub](https://github.com/KleistLab/WarpDemuX)
+- [WarpDemuX-tRNA Preprint](https://www.biorxiv.org/content/10.1101/2025.03.21.644602v1)