split_fastqcats is an advanced tool for processing and splitting concatenated FastQ reads, specifically designed for long-read RNA sequencing data (such as Oxford Nanopore). It is uniquely tailored to handle the complex structure of cDNA reads generated by modern single-cell and full-length RNA sequencing protocols, where multiple transcript copies can be concatenated in a single read.
- De-concatenates long-read cDNA FastQ reads that contain multiple transcript copies, using primer and/or barcode (index) sequences.
- Identifies and validates primer sequences using robust Smith-Waterman alignment, allowing for sequencing errors and configurable mismatch tolerance.
- Processes polyA tails and polyTs to ensure accurate splitting and downstream analysis.
- Splits concatenated reads into individual, high-quality segments, even when reads contain multiple full-length transcripts.
- Provides detailed processing statistics for quality control and troubleshooting.
- Supports both primer-based and barcode/index-based splitting, with flexible batch pipelines for both local and cluster (e.g., Slurm) environments.
Typical read structure handled:
TSO---cDNA---polyA---UMI---revRTprimer
Example:
AAGCAGTGGTATCAACGCAGAGTGAAT---cDNA---polyA--N---UMI---GTACTCTGCGTTGATACCACTGCTT
While several tools exist for demultiplexing, barcode splitting, or primer trimming (such as Porechop, cutadapt, or UMI-tools), split_fastqcats offers unique capabilities for long-read RNA sequencing workflows:
- Specifically designed for long-read, concatenated cDNA data (e.g., Nanopore), not just short-read or simple amplicon data.
- De-concatenation: Accurately splits reads containing multiple full-length transcript copies (concatemers), a scenario common in long-read library preps but poorly handled by most other tools.
- Flexible, error-tolerant primer and index matching using Smith-Waterman alignment, not just exact or Hamming distance matching. This enables robust handling of real-world sequencing errors and protocol variability.
- Configurable mismatch tolerance and quality control thresholds, allowing adaptation to different protocols and data quality.
- Detailed output statistics and QC at every processing step, enabling transparency and troubleshooting.
- Batch processing pipelines (CGAT-style) for high-throughput analysis on clusters or locally, with easy integration into existing workflows.
- Supports both CLI and Python API usage, making it suitable for both pipeline automation and interactive analysis.
If your workflow involves long-read RNA sequencing with concatenated reads and you need robust, flexible, and transparent splitting based on complex primer/index structures, split_fastqcats is purpose-built for your needs.
pip install split_fastqcatsgit clone https://github.com/cribbslab/split_fastqcats.git
cd split_fastqcats
pip install -e .Note:
parasailis only available on Linux via bioconda. The environment will not resolve on macOS ARM (Apple Silicon).
mamba env create -f environment.yml
conda activate split_fastqcats
pip install -e .parasailcannot be installed natively on macOS ARM. For development or testing on Apple Silicon, use the provided test mocks, or run in a Linux Docker container.
Basic usage:
# Primer Pair Split
split-fastqcats primer_pair_split \
-i input.fastq.gz \
--processed-output processed.fastq.gz \
--lowqual-output lowqual.fastq.gz \
--bin-output bin.fastq.gz \
--stats-output stats.csv \
-fp AAGCAGTGGTATCAACGCAGAGTGAAT \
-rp GTACTCTGCGTTGATACCACTGCTT \ ## Note primer input orientation
--error 0.3 \
--chunk-size 1000 \
--num_workers 4 \
-v
# Barcode Split
split-fastqcats barcode_split \
-i input.fastq.gz \
--indexes AAATTTGGGCCC TTTCCCAAAGGG \
--processed-output processed.fastq.gz \
--lowqual-output lowqual.fastq.gz \
--bin-output bin.fastq.gz \
--stats-output stats.csv \
-fp AAGCAGTGGTATCAACGCAGAGT \
--error 3 \
--chunk-size 1000 \
--num_workers 4 \
-v
-i, --input_file: Input FASTQ file (gzipped)-res, --results-dir: Output directory (default is based on input filename)--processed-output: Output file for correctly processed reads--lowqual-output: Output file for low-quality reads--bin-output: Output file for binned reads--stats-output: Output statistics file (CSV)-fp, --forward-primer: Forward primer sequence (default:AAGCAGTGGTATCAACGCAGAGTGAAT)-rp, --reverse-primer: Reverse primer sequence (default:GTACTCTGCGTTGATACCACTGCTT)-e, --error: Number of allowed mismatches (default:0.3)--chunk-size: Number of reads per chunk (default:1000)--num_workers: Number of parallel workers (default:4)-v, --verbose: Enable detailed logging
-i, --input_file: Input FASTQ file (gzipped)-res, --results-dir: Output directory (default is based on input filename)--processed-output: Output file for processed reads, one per index--lowqual-output: Output file for low-quality reads--bin-output: Output file for binned reads (no barcode)--stats-output: Output statistics file (CSV)-fp, --forward-primer: Forward primer sequence (default:AAGCAGTGGTATCAACGCAGAGT)--indexes: List of index/barcode sequences-e, --error: Number of allowed mismatches (default:3)--chunk-size: Number of reads per chunk (default:1000)--num_workers: Number of parallel workers (default:4)-v, --verbose: Enable detailed logging
Note: The -e argument can be given as a float between 0 to 1, indicating the fraction of bases that can mismatch (rounded down), or as an integer indicating an absolute mismatch threshold. E.g. "3" indicates a maximum of 3 matches allowed. "0.3" indicates for a 25bp pattern, 0.3 * 25 = 7.5 --> therefore 7 mismatches allowed.
-
processed.fastq.gz: Contains reads that:- Have valid primer pairs
- Meet quality thresholds
- Have correct index sequences (if specified)
-
lowqual.fastq.gz: Contains reads that:- Have only one valid primer
- Have mismatched indexes
- Meet basic quality requirements but fail stricter criteria (e.g., length, polyA)
-
bin.fastq.gz: Contains reads that:- Lack valid primers
- Have too many primer matches (>10)
- Fail basic quality requirements
-
stats.csv: Contains processing statistics:- Total sequences processed
- Number of processed reads
- Number of low-quality reads
- Number of binned reads
- Full-length vs low-quality segment breakdown
-
processed_index_<barcode>.fastq.gz: One output file per index containing processed reads for that barcode. -
lowqual.fastq.gz: Contains low-quality reads that fail length filters. -
bin.fastq.gz: Contains reads with no barcode hits. -
stats.csv: Contains processing statistics:- Total sequences processed
- Number of processed reads
- Number of binned reads (no barcode)
- Total segments identified by de-concatenation
- Breakdown of processed vs low-quality segments failing on length filter
- Barcode Count Table: A separate table giving a count of segments for each barcode in the list.
The tool supports CGAT-style pipelines for batch processing. Raw fastq.gz files to be placed is directory in the working directory called data.dir.
# To de-concatenate and de-multiplex reads by index/barcode on a cluster (e.g. Slurmm).
split-fastqcats split_by_index config # Generate pipeline.yml
split-fastqcats split_by_index make full -v5 # Run the pipeline
## To identify full length reads on a cluster (e.g. Slurmm)
split-fastqcats fl_rna config # Generate pipeline.yml
split-fastqcats fl_rna make full -v5 # Run the pipeline
## To run the pipelines locally without a cluster:
split-fastqcats split_by_index make full -v5 --local
split-fastqcats fl_rna make full -v5 --local
# To de-concatenate and identify full length reads
from split_fastqcats import PrimerSplitter
splitter = PrimerSplitter(
forward_primer="AAGCAGTGGTATCAACGCAGAGTGAAT",
reverse_primer="GTACTCTGCGTTGATACCACTGCTT",
error=0.3
)
splitter.parallel_split_reads(
input_file="input.fastq.gz",
processed_output="processed.fastq.gz",
lowqual_output="lowqual.fastq.gz",
bin_output="bin.fastq.gz",
stats_output="stats.csv",
num_workers=4,
chunk_size=1000
)
# To de-concatenate and de-multiplex by barcodes
from split_fastqcats import IndexSplitter
index_dict = {
'1': 'AAATTTGGGCCC',
'2': 'TTTCCCAAAGGG'
}
splitter = IndexSplitter(
forward_primer="AAGCAGTGGTATCAACGCAGAGT",
index_dict=index_dict,
error=3
)
splitter.parallel_split_reads(
input_file="input.fastq.gz",
processed_output="processed.fastq.gz",
lowqual_output="lowqual.fastq.gz",
bin_output="bin.fastq.gz",
stats_output="stats.csv",
num_workers=4,
chunk_size=1000
)
-
processed.fastq.gz: Contains reads that:- Have valid primer pairs
- Meet quality thresholds
- Have correct index sequences (if specified)
-
lowqual.fastq.gz: Contains reads that:- Have only one valid primer
- Have mismatched indexes
- Meet basic quality requirements but fail stricter criteria
-
bin.fastq.gz: Contains reads that:- Lack valid primers
- Have too many primer matches (>10)
- Fail basic quality requirements
-
stats.csv: Contains processing statistics:- Total sequences processed
- Number of processed reads
- Number of low-quality reads
- Number of binned reads
The primer pair matching de-concatenating tool used to identify full-length mRNA implements several QC measures:
- Smith-Waterman alignment for primer detection
- Minimum 70% match score required for primer identification (using default parameters)
- Minimum sequence length 300, Max length 50,000 bp
- Checking for polyA tails at the ends of the sequences
The barcode matching de-concatenating tool uses:
- Index validation with 80% minimum match score (using default parameters)
- Position validation for index sequences - i.e. they should flank the primers
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
If you use this tool in your research, please cite:
@software{split_fastqcats2024,
author = {Cribbs, Adam},
title = {split_fastqcats: A tool for processing concatenated FastQ reads},
year = {2024},
publisher = {GitHub},
url = {https://github.com/cribbslab/split_fastqcats}
}