Skip to content

Containing code for splitting out concatenated fastq files

License

Notifications You must be signed in to change notification settings

cribbslab/split_fastqcats

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

split_fastqcats

split_fastqcats is an advanced tool for processing and splitting concatenated FastQ reads, specifically designed for long-read RNA sequencing data (such as Oxford Nanopore). It is uniquely tailored to handle the complex structure of cDNA reads generated by modern single-cell and full-length RNA sequencing protocols, where multiple transcript copies can be concatenated in a single read.

What does split_fastqcats do?

  • De-concatenates long-read cDNA FastQ reads that contain multiple transcript copies, using primer and/or barcode (index) sequences.
  • Identifies and validates primer sequences using robust Smith-Waterman alignment, allowing for sequencing errors and configurable mismatch tolerance.
  • Processes polyA tails and polyTs to ensure accurate splitting and downstream analysis.
  • Splits concatenated reads into individual, high-quality segments, even when reads contain multiple full-length transcripts.
  • Provides detailed processing statistics for quality control and troubleshooting.
  • Supports both primer-based and barcode/index-based splitting, with flexible batch pipelines for both local and cluster (e.g., Slurm) environments.

Typical read structure handled:

TSO---cDNA---polyA---UMI---revRTprimer

Example:

AAGCAGTGGTATCAACGCAGAGTGAAT---cDNA---polyA--N---UMI---GTACTCTGCGTTGATACCACTGCTT

How is split_fastqcats different from other tools?

While several tools exist for demultiplexing, barcode splitting, or primer trimming (such as Porechop, cutadapt, or UMI-tools), split_fastqcats offers unique capabilities for long-read RNA sequencing workflows:

  • Specifically designed for long-read, concatenated cDNA data (e.g., Nanopore), not just short-read or simple amplicon data.
  • De-concatenation: Accurately splits reads containing multiple full-length transcript copies (concatemers), a scenario common in long-read library preps but poorly handled by most other tools.
  • Flexible, error-tolerant primer and index matching using Smith-Waterman alignment, not just exact or Hamming distance matching. This enables robust handling of real-world sequencing errors and protocol variability.
  • Configurable mismatch tolerance and quality control thresholds, allowing adaptation to different protocols and data quality.
  • Detailed output statistics and QC at every processing step, enabling transparency and troubleshooting.
  • Batch processing pipelines (CGAT-style) for high-throughput analysis on clusters or locally, with easy integration into existing workflows.
  • Supports both CLI and Python API usage, making it suitable for both pipeline automation and interactive analysis.

If your workflow involves long-read RNA sequencing with concatenated reads and you need robust, flexible, and transparent splitting based on complex primer/index structures, split_fastqcats is purpose-built for your needs.

Installation

Using pip

pip install split_fastqcats

From source

git clone https://github.com/cribbslab/split_fastqcats.git
cd split_fastqcats
pip install -e .

Using conda/mamba (Linux)

Note: parasail is only available on Linux via bioconda. The environment will not resolve on macOS ARM (Apple Silicon).

mamba env create -f environment.yml
conda activate split_fastqcats
pip install -e .

Troubleshooting for Apple Silicon/macOS users

  • parasail cannot be installed natively on macOS ARM. For development or testing on Apple Silicon, use the provided test mocks, or run in a Linux Docker container.

Usage

Command Line Interface

Basic usage:

# Primer Pair Split
split-fastqcats primer_pair_split \
    -i input.fastq.gz \
    --processed-output processed.fastq.gz \
    --lowqual-output lowqual.fastq.gz \
    --bin-output bin.fastq.gz \
    --stats-output stats.csv \
    -fp AAGCAGTGGTATCAACGCAGAGTGAAT \
    -rp GTACTCTGCGTTGATACCACTGCTT \ ## Note primer input orientation
    --error 0.3 \
    --chunk-size 1000 \
    --num_workers 4 \
    -v

# Barcode Split
split-fastqcats barcode_split \
    -i input.fastq.gz \
    --indexes AAATTTGGGCCC TTTCCCAAAGGG \
    --processed-output processed.fastq.gz \
    --lowqual-output lowqual.fastq.gz \
    --bin-output bin.fastq.gz \
    --stats-output stats.csv \
    -fp AAGCAGTGGTATCAACGCAGAGT \
    --error 3 \
    --chunk-size 1000 \
    --num_workers 4 \
    -v

Arguments

Primer Pair Split

  • -i, --input_file: Input FASTQ file (gzipped)
  • -res, --results-dir: Output directory (default is based on input filename)
  • --processed-output: Output file for correctly processed reads
  • --lowqual-output: Output file for low-quality reads
  • --bin-output: Output file for binned reads
  • --stats-output: Output statistics file (CSV)
  • -fp, --forward-primer: Forward primer sequence (default: AAGCAGTGGTATCAACGCAGAGTGAAT)
  • -rp, --reverse-primer: Reverse primer sequence (default: GTACTCTGCGTTGATACCACTGCTT)
  • -e, --error: Number of allowed mismatches (default: 0.3)
  • --chunk-size: Number of reads per chunk (default: 1000)
  • --num_workers: Number of parallel workers (default: 4)
  • -v, --verbose: Enable detailed logging

Barcode/Index Split

  • -i, --input_file: Input FASTQ file (gzipped)
  • -res, --results-dir: Output directory (default is based on input filename)
  • --processed-output: Output file for processed reads, one per index
  • --lowqual-output: Output file for low-quality reads
  • --bin-output: Output file for binned reads (no barcode)
  • --stats-output: Output statistics file (CSV)
  • -fp, --forward-primer: Forward primer sequence (default: AAGCAGTGGTATCAACGCAGAGT)
  • --indexes: List of index/barcode sequences
  • -e, --error: Number of allowed mismatches (default: 3)
  • --chunk-size: Number of reads per chunk (default: 1000)
  • --num_workers: Number of parallel workers (default: 4)
  • -v, --verbose: Enable detailed logging

Note: The -e argument can be given as a float between 0 to 1, indicating the fraction of bases that can mismatch (rounded down), or as an integer indicating an absolute mismatch threshold. E.g. "3" indicates a maximum of 3 matches allowed. "0.3" indicates for a 25bp pattern, 0.3 * 25 = 7.5 --> therefore 7 mismatches allowed.


Outputs

Primer Pair Split Outputs

  1. processed.fastq.gz: Contains reads that:

    • Have valid primer pairs
    • Meet quality thresholds
    • Have correct index sequences (if specified)
  2. lowqual.fastq.gz: Contains reads that:

    • Have only one valid primer
    • Have mismatched indexes
    • Meet basic quality requirements but fail stricter criteria (e.g., length, polyA)
  3. bin.fastq.gz: Contains reads that:

    • Lack valid primers
    • Have too many primer matches (>10)
    • Fail basic quality requirements
  4. stats.csv: Contains processing statistics:

    • Total sequences processed
    • Number of processed reads
    • Number of low-quality reads
    • Number of binned reads
    • Full-length vs low-quality segment breakdown

Barcode/Index Split Outputs

  1. processed_index_<barcode>.fastq.gz: One output file per index containing processed reads for that barcode.

  2. lowqual.fastq.gz: Contains low-quality reads that fail length filters.

  3. bin.fastq.gz: Contains reads with no barcode hits.

  4. stats.csv: Contains processing statistics:

    • Total sequences processed
    • Number of processed reads
    • Number of binned reads (no barcode)
    • Total segments identified by de-concatenation
    • Breakdown of processed vs low-quality segments failing on length filter
    • Barcode Count Table: A separate table giving a count of segments for each barcode in the list.

The tool supports CGAT-style pipelines for batch processing. Raw fastq.gz files to be placed is directory in the working directory called data.dir.

# To de-concatenate and de-multiplex reads by index/barcode on a cluster (e.g. Slurmm). 
split-fastqcats split_by_index config         # Generate pipeline.yml
split-fastqcats split_by_index make full -v5  # Run the pipeline


## To identify full length reads on a cluster (e.g. Slurmm)
split-fastqcats fl_rna config         # Generate pipeline.yml
split-fastqcats fl_rna make full -v5  # Run the pipeline


## To run the pipelines locally without a cluster:
split-fastqcats split_by_index make full -v5 --local
split-fastqcats fl_rna make full -v5 --local

Outputs for each sample will be produced in the merged_results.dir.

Python API

# To de-concatenate and identify full length reads
from split_fastqcats import PrimerSplitter

splitter = PrimerSplitter(
    forward_primer="AAGCAGTGGTATCAACGCAGAGTGAAT",
    reverse_primer="GTACTCTGCGTTGATACCACTGCTT",
    error=0.3
)

splitter.parallel_split_reads(
    input_file="input.fastq.gz",
    processed_output="processed.fastq.gz",
    lowqual_output="lowqual.fastq.gz",
    bin_output="bin.fastq.gz",
    stats_output="stats.csv",
    num_workers=4,
    chunk_size=1000
)

# To de-concatenate and de-multiplex by barcodes
from split_fastqcats import IndexSplitter

index_dict = {
    '1': 'AAATTTGGGCCC',
    '2': 'TTTCCCAAAGGG'
}

splitter = IndexSplitter(
    forward_primer="AAGCAGTGGTATCAACGCAGAGT",
    index_dict=index_dict,
    error=3
)

splitter.parallel_split_reads(
    input_file="input.fastq.gz",
    processed_output="processed.fastq.gz",
    lowqual_output="lowqual.fastq.gz",
    bin_output="bin.fastq.gz",
    stats_output="stats.csv",
    num_workers=4,
    chunk_size=1000
)

Output Files

  1. processed.fastq.gz: Contains reads that:

    • Have valid primer pairs
    • Meet quality thresholds
    • Have correct index sequences (if specified)
  2. lowqual.fastq.gz: Contains reads that:

    • Have only one valid primer
    • Have mismatched indexes
    • Meet basic quality requirements but fail stricter criteria
  3. bin.fastq.gz: Contains reads that:

    • Lack valid primers
    • Have too many primer matches (>10)
    • Fail basic quality requirements
  4. stats.csv: Contains processing statistics:

    • Total sequences processed
    • Number of processed reads
    • Number of low-quality reads
    • Number of binned reads

Quality Control Parameters

The primer pair matching de-concatenating tool used to identify full-length mRNA implements several QC measures:

  • Smith-Waterman alignment for primer detection
  • Minimum 70% match score required for primer identification (using default parameters)
  • Minimum sequence length 300, Max length 50,000 bp
  • Checking for polyA tails at the ends of the sequences

The barcode matching de-concatenating tool uses:

  • Index validation with 80% minimum match score (using default parameters)
  • Position validation for index sequences - i.e. they should flank the primers

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this tool in your research, please cite:

@software{split_fastqcats2024,
  author = {Cribbs, Adam},
  title = {split_fastqcats: A tool for processing concatenated FastQ reads},
  year = {2024},
  publisher = {GitHub},
  url = {https://github.com/cribbslab/split_fastqcats}
}

About

Containing code for splitting out concatenated fastq files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published