split_fastqcats

split_fastqcats is an advanced tool for processing and splitting concatenated FastQ reads, specifically designed for long-read RNA sequencing data (such as Oxford Nanopore). It is uniquely tailored to handle the complex structure of cDNA reads generated by modern single-cell and full-length RNA sequencing protocols, where multiple transcript copies can be concatenated in a single read.

What does split_fastqcats do?

De-concatenates long-read cDNA FastQ reads that contain multiple transcript copies, using primer and/or barcode (index) sequences.
Identifies and validates primer sequences using robust Smith-Waterman alignment, allowing for sequencing errors and configurable mismatch tolerance.
Processes polyA tails and polyTs to ensure accurate splitting and downstream analysis.
Splits concatenated reads into individual, high-quality segments, even when reads contain multiple full-length transcripts.
Provides detailed processing statistics for quality control and troubleshooting.
Supports both primer-based and barcode/index-based splitting, with flexible batch pipelines for both local and cluster (e.g., Slurm) environments.

Typical read structure handled:

TSO---cDNA---polyA---UMI---revRTprimer

Example:

AAGCAGTGGTATCAACGCAGAGTGAAT---cDNA---polyA--N---UMI---GTACTCTGCGTTGATACCACTGCTT

How is split_fastqcats different from other tools?

While several tools exist for demultiplexing, barcode splitting, or primer trimming (such as Porechop, cutadapt, or UMI-tools), split_fastqcats offers unique capabilities for long-read RNA sequencing workflows:

Specifically designed for long-read, concatenated cDNA data (e.g., Nanopore), not just short-read or simple amplicon data.
De-concatenation: Accurately splits reads containing multiple full-length transcript copies (concatemers), a scenario common in long-read library preps but poorly handled by most other tools.
Flexible, error-tolerant primer and index matching using Smith-Waterman alignment, not just exact or Hamming distance matching. This enables robust handling of real-world sequencing errors and protocol variability.
Configurable mismatch tolerance and quality control thresholds, allowing adaptation to different protocols and data quality.
Detailed output statistics and QC at every processing step, enabling transparency and troubleshooting.
Batch processing pipelines (CGAT-style) for high-throughput analysis on clusters or locally, with easy integration into existing workflows.
Supports both CLI and Python API usage, making it suitable for both pipeline automation and interactive analysis.

If your workflow involves long-read RNA sequencing with concatenated reads and you need robust, flexible, and transparent splitting based on complex primer/index structures, split_fastqcats is purpose-built for your needs.

Installation

Using pip

pip install split_fastqcats

From source

git clone https://github.com/cribbslab/split_fastqcats.git
cd split_fastqcats
pip install -e .

Using conda/mamba (Linux)

Note: parasail is only available on Linux via bioconda. The environment will not resolve on macOS ARM (Apple Silicon).

mamba env create -f environment.yml
conda activate split_fastqcats
pip install -e .

Troubleshooting for Apple Silicon/macOS users

parasail cannot be installed natively on macOS ARM. For development or testing on Apple Silicon, use the provided test mocks, or run in a Linux Docker container.

Usage

Command Line Interface

Basic usage:

# Primer Pair Split
split-fastqcats primer_pair_split \
    -i input.fastq.gz \
    --processed-output processed.fastq.gz \
    --lowqual-output lowqual.fastq.gz \
    --bin-output bin.fastq.gz \
    --stats-output stats.csv \
    -fp AAGCAGTGGTATCAACGCAGAGTGAAT \
    -rp GTACTCTGCGTTGATACCACTGCTT \ ## Note primer input orientation
    --error 0.3 \
    --chunk-size 1000 \
    --num_workers 4 \
    -v

# Barcode Split
split-fastqcats barcode_split \
    -i input.fastq.gz \
    --indexes AAATTTGGGCCC TTTCCCAAAGGG \
    --processed-output processed.fastq.gz \
    --lowqual-output lowqual.fastq.gz \
    --bin-output bin.fastq.gz \
    --stats-output stats.csv \
    -fp AAGCAGTGGTATCAACGCAGAGT \
    --error 3 \
    --chunk-size 1000 \
    --num_workers 4 \
    -v

Arguments

Primer Pair Split

-i, --input_file: Input FASTQ file (gzipped)
-res, --results-dir: Output directory (default is based on input filename)
--processed-output: Output file for correctly processed reads
--lowqual-output: Output file for low-quality reads
--bin-output: Output file for binned reads
--stats-output: Output statistics file (CSV)
-fp, --forward-primer: Forward primer sequence (default: AAGCAGTGGTATCAACGCAGAGTGAAT)
-rp, --reverse-primer: Reverse primer sequence (default: GTACTCTGCGTTGATACCACTGCTT)
-e, --error: Number of allowed mismatches (default: 0.3)
--chunk-size: Number of reads per chunk (default: 1000)
--num_workers: Number of parallel workers (default: 4)
-v, --verbose: Enable detailed logging

Barcode/Index Split

-i, --input_file: Input FASTQ file (gzipped)
-res, --results-dir: Output directory (default is based on input filename)
--processed-output: Output file for processed reads, one per index
--lowqual-output: Output file for low-quality reads
--bin-output: Output file for binned reads (no barcode)
--stats-output: Output statistics file (CSV)
-fp, --forward-primer: Forward primer sequence (default: AAGCAGTGGTATCAACGCAGAGT)
--indexes: List of index/barcode sequences
-e, --error: Number of allowed mismatches (default: 3)
--chunk-size: Number of reads per chunk (default: 1000)
--num_workers: Number of parallel workers (default: 4)
-v, --verbose: Enable detailed logging

Note: The -e argument can be given as a float between 0 to 1, indicating the fraction of bases that can mismatch (rounded down), or as an integer indicating an absolute mismatch threshold. E.g. "3" indicates a maximum of 3 matches allowed. "0.3" indicates for a 25bp pattern, 0.3 * 25 = 7.5 --> therefore 7 mismatches allowed.

Outputs

Primer Pair Split Outputs

processed.fastq.gz: Contains reads that:
- Have valid primer pairs
- Meet quality thresholds
- Have correct index sequences (if specified)
lowqual.fastq.gz: Contains reads that:
- Have only one valid primer
- Have mismatched indexes
- Meet basic quality requirements but fail stricter criteria (e.g., length, polyA)
bin.fastq.gz: Contains reads that:
- Lack valid primers
- Have too many primer matches (>10)
- Fail basic quality requirements
stats.csv: Contains processing statistics:
- Total sequences processed
- Number of processed reads
- Number of low-quality reads
- Number of binned reads
- Full-length vs low-quality segment breakdown

Barcode/Index Split Outputs

processed_index_<barcode>.fastq.gz: One output file per index containing processed reads for that barcode.
lowqual.fastq.gz: Contains low-quality reads that fail length filters.
bin.fastq.gz: Contains reads with no barcode hits.
stats.csv: Contains processing statistics:
- Total sequences processed
- Number of processed reads
- Number of binned reads (no barcode)
- Total segments identified by de-concatenation
- Breakdown of processed vs low-quality segments failing on length filter
- Barcode Count Table: A separate table giving a count of segments for each barcode in the list.

The tool supports CGAT-style pipelines for batch processing. Raw fastq.gz files to be placed is directory in the working directory called data.dir.

# To de-concatenate and de-multiplex reads by index/barcode on a cluster (e.g. Slurmm). 
split-fastqcats split_by_index config         # Generate pipeline.yml
split-fastqcats split_by_index make full -v5  # Run the pipeline


## To identify full length reads on a cluster (e.g. Slurmm)
split-fastqcats fl_rna config         # Generate pipeline.yml
split-fastqcats fl_rna make full -v5  # Run the pipeline


## To run the pipelines locally without a cluster:
split-fastqcats split_by_index make full -v5 --local
split-fastqcats fl_rna make full -v5 --local

Outputs for each sample will be produced in the merged_results.dir.

Python API

# To de-concatenate and identify full length reads
from split_fastqcats import PrimerSplitter

splitter = PrimerSplitter(
    forward_primer="AAGCAGTGGTATCAACGCAGAGTGAAT",
    reverse_primer="GTACTCTGCGTTGATACCACTGCTT",
    error=0.3
)

splitter.parallel_split_reads(
    input_file="input.fastq.gz",
    processed_output="processed.fastq.gz",
    lowqual_output="lowqual.fastq.gz",
    bin_output="bin.fastq.gz",
    stats_output="stats.csv",
    num_workers=4,
    chunk_size=1000
)

# To de-concatenate and de-multiplex by barcodes
from split_fastqcats import IndexSplitter

index_dict = {
    '1': 'AAATTTGGGCCC',
    '2': 'TTTCCCAAAGGG'
}

splitter = IndexSplitter(
    forward_primer="AAGCAGTGGTATCAACGCAGAGT",
    index_dict=index_dict,
    error=3
)

splitter.parallel_split_reads(
    input_file="input.fastq.gz",
    processed_output="processed.fastq.gz",
    lowqual_output="lowqual.fastq.gz",
    bin_output="bin.fastq.gz",
    stats_output="stats.csv",
    num_workers=4,
    chunk_size=1000
)

Output Files

processed.fastq.gz: Contains reads that:
- Have valid primer pairs
- Meet quality thresholds
- Have correct index sequences (if specified)
lowqual.fastq.gz: Contains reads that:
- Have only one valid primer
- Have mismatched indexes
- Meet basic quality requirements but fail stricter criteria
bin.fastq.gz: Contains reads that:
- Lack valid primers
- Have too many primer matches (>10)
- Fail basic quality requirements
stats.csv: Contains processing statistics:
- Total sequences processed
- Number of processed reads
- Number of low-quality reads
- Number of binned reads

Quality Control Parameters

The primer pair matching de-concatenating tool used to identify full-length mRNA implements several QC measures:

Smith-Waterman alignment for primer detection
Minimum 70% match score required for primer identification (using default parameters)
Minimum sequence length 300, Max length 50,000 bp
Checking for polyA tails at the ends of the sequences

The barcode matching de-concatenating tool uses:

Index validation with 80% minimum match score (using default parameters)
Position validation for index sequences - i.e. they should flank the primers

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use this tool in your research, please cite:

@software{split_fastqcats2024,
  author = {Cribbs, Adam},
  title = {split_fastqcats: A tool for processing concatenated FastQ reads},
  year = {2024},
  publisher = {GitHub},
  url = {https://github.com/cribbslab/split_fastqcats}
}

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
.github		.github
.vscode		.vscode
conda		conda
src/split_fastqcats		src/split_fastqcats
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

split_fastqcats

What does split_fastqcats do?

How is split_fastqcats different from other tools?

Installation

Using pip

From source

Using conda/mamba (Linux)

Troubleshooting for Apple Silicon/macOS users

Usage

Command Line Interface

Arguments

Primer Pair Split

Barcode/Index Split

Outputs

Primer Pair Split Outputs

Barcode/Index Split Outputs

Outputs for each sample will be produced in the merged_results.dir.

Python API

Output Files

Quality Control Parameters

Contributing

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Languages

License

cribbslab/split_fastqcats

Folders and files

Latest commit

History

Repository files navigation

split_fastqcats

What does split_fastqcats do?

How is split_fastqcats different from other tools?

Installation

Using pip

From source

Using conda/mamba (Linux)

Troubleshooting for Apple Silicon/macOS users

Usage

Command Line Interface

Arguments

Primer Pair Split

Barcode/Index Split

Outputs

Primer Pair Split Outputs

Barcode/Index Split Outputs

Outputs for each sample will be produced in the merged_results.dir.

Python API

Output Files

Quality Control Parameters

Contributing

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Languages

Packages