SpliceScape is a bioinformatics pipeline designed for the large-scale identification and characterization of splicing events from RNA-Seq data. Built using the Nextflow workflow orchestrator, it provides an efficient, reproducible, and scalable solution for generating comprehensive splicing landscapes of an organism.
The pipeline automates all critical steps of splicing analysis, from raw data processing to the final characterization of events. It integrates state-of-the-art tools to ensure high accuracy and performance, including:
- Data Cleaning: BBDuK
- Splicing-Aware Alignment: STAR
- Splicing Event Identification and Quantification: MAJIQ and SGSeq
This approach allows SpliceScape to handle large public datasets, making it a powerful tool for comparative transcriptomics and for exploring the dynamics of RNA processing.
📊 The diagram below provides an overview of the SpliceScape workflow, from the initial input files to the final results.
- Pre-processing: Metadata acquisition and filtering.
- Core Pipeline: Downloading reads, quality control (BBDuK), genome indexing (STAR), mapping (STAR), and splicing analysis (MAJIQ & SGSeq).
- Post-processing: Parsing and integrating results into a unified database.
SpliceScape has several dependencies, below we list some of the main dependencies that must be installed prior to using the pipeline. We recommend to use the conda environment to install all dependencies.
Core Software & Tools
- Nextflow (v24.10.5 or later)
- STAR
- BBMap (provides BBDuK v35.85)
- Samtools
- MAJIQ
- ffq (from pip)
- wget & md5sum (Standard Linux/macOS utilities)
Programming Environments
- Python (v3.7 or higher)
- R (v4.4)
Required Python Libraries
- requests (v2.28.1)
- beautifulsoup4 (v0.0.1)
- biopython (v1.79)
- pandas (v1.4.0)
Required R Libraries
- optparse (v1.7.5)
- SGSeq (v1.38.0)
- GenomicFeatures (v1.56.0)
SpliceScape requires three main inputs to start an analysis:
- Reference Genome: An uncompressed reference genome file in
.fastaformat, preferably downloaded from Phytozome. - Genome Annotation: An uncompressed genome annotation file in
.gff3format, also from Phytozome. - Sample List: A plain text (
.txt) file containing the SRA accessions for your target samples, with one identifier per line.
For detailed instructions on how to obtain and format these files, please refer to the SpliceScape Wiki.
git clone https://github.com/labbces/SpliceScape.git
cd SpliceScape
SpliceScape is designed to work with the standard file formats and structures provided by the Phytozome database. For the pipeline to locate the genome and annotation files correctly, you must follow a specific directory structure.
The three main files required for each species are:
- A reference genome in FASTA format (.fa). - From Phytozome
- A genome annotation file in GFF3 format (.gff3). - From Phytozome
- A plain text file (.txt) listing the target SRA accessions, with one per line.
A critical requirement is that both genome files must be uncompressed before running the pipeline. SpliceScape does not handle gzipped files (e.g., .fa.gz) for the genome and annotation inputs.
Recommended Directory Structure
We recommend creating a main directory for your project's input data and then creating a subdirectory for each species downloaded from Phytozome. Inside each species directory, you should have two subdirectories: assembly and annotation.
/path/to/your/project/
└── data/
└── Phytozome/
└── Athaliana_447_Araport11/ <-- Main species directory
├── assembly/
│ └── Athaliana_447_TAIR10.fa <-- Place the uncompressed FASTA file here
└── annotation/
└── Athaliana_447_Araport11.gene_exons.gff3 <-- Place the uncompressed GFF3 file here
- For more details, please see our 0.3 Input Files: Preparing Genome and Transcriptome Data page
The pipeline is primarily configured using the splicescape_paired.config file. You must edit this file to provide the paths to your input files and executables. Below are the most critical parameters to set:
| Section | "Name/Variable" | Description | Example |
|---|---|---|---|
| workDir | Where files generated by Splicescape will be saved. | "/home/Splicescape/results" |
|
| params | outdir | Where Splicescape results will be saved (inside workDir). | "${workDir}/output" |
| species | Species name - Must at least start with the Phytozome naming convention. | "Athaliana_447" |
|
| reads_file | File containing SRR identifiers of interest (one per line). | "/home/Splicescape/data/SRRs.txt" |
|
| genomeFASTA | Path to genome sequence of interest. | "/home/Splicescape/data/Phytozome/Athaliana_447_Araport11/assembly/Athaliana_447_TAIR10.fa" |
|
| genomeGFF | Path to genome annotation. | "/home/Splicescape/data/Phytozome/Athaliana_447_Araport11/annotation/Athaliana_447_Araport11.gene_exons.gff3" |
|
| genome_path | Path to Phytozome assembly directory. | "/home/Splicescape/data/Phytozome/Athaliana_447_Araport11/assembly" |
|
| genome | Path containing both assembly and annotation. | "/home/Splicescape/data/Phytozome/Athaliana_447_Araport11" |
|
| threads | Cores for genomic index generation with STAR. | 12 |
|
| bbduk | Path to BBDuK executable. | "/home/Splicescape/progs/bbmap_35.85/bbduk2.sh" |
|
| minlength | Minimum read length for BBDuK cleaning. | 60 |
|
| trimq | Quality threshold for base trimming in sequencing reads (BBDuK). | 20 |
|
| k | k-mer size used by BBDuK to find matches and remove specific sequences. | 27 |
|
| rref | Path to file with sequences to be removed from reads (BBDuK). | "/home/Splicescape/progs/BBMap/resources/adapters.fa" |
|
| maxmem | Maximum memory allocated for BBDuK. | "20g" |
|
| r_libs | Path where R libraries are installed. | "/home/Splicescape/R/library" |
|
| sgseq_cores | Maximum cores used by SGSeq. | 4 |
|
| majiq_path | Path to MAJIQ bin directory. | "/home/Splicescape/majiq/bin" |
|
| majiq_cores | Maximum cores used by MAJIQ. | 8 |
|
| executor | queueSize | Maximum number of files that can be simultaneously submitted to a cluster. | 50 |
| Activity Report (Report, timeline, trace) | enabled | Enables report generation. | TRUE |
| file | Report filenames. | "report.html", "timeline.html", "trace.txt" |
|
| overwrite | Allows overwriting files with same name (Caution when using -resume). | TRUE |
|
| SGE Cluster Profile | executor | Instructs Nextflow to submit each task as a "job" to SGE scheduler. | sge |
| queue | Specifies default SGE queue for job submission. | splicing.q@cluster |
|
| clusterOptions | Default options passed directly to SGE's qsub command for each job. | -S /bin/bash -V -pe smp 2 |
|
| scratch | Enables use of fast "scratch" directory on cluster for intermediate files. | FALSE |
|
| maxForks | Limits execution to maximum concurrent processes of each type. | 5 |
|
| withName | Overrides default settings for specific processes (resource allocation). | "DOWNLOAD_READ_FTP { clusterOptions = '-S /bin/bash -pe smp 2 -l h_vmem=2G -V', maxForks = 10 }" |
Execute the pipeline using the nextflow run command. If you are using a cluster with a scheduler like SGE, you can use a profile.
nextflow run splicescape_paired.nf -c splicescape_paired.config -profile sge -resume
Please find this files in reads_processing.
In cases where direct downloads from the NCBI SRA are not feasible or have failed, SpliceScape provides an alternative workflow to download reads from a private, password-protected server where the data has been pre-staged.
To use this method, you must:
- Run the
splicescape_paired_cloud.nfpipeline script instead of the default one. - Fill the following parameters for your private server to your
.configfile:
| Parameter | Description |
|---|---|
| params.url | The base URL of the directory on the cloud server. |
| params.user | The username required for authentication. |
| params.password | The password required for authentication. |
Your run command will then specify the alternative workflow script:
nextflow run splicescape_paired_cloud.nf -c your_config_file.config -profile sge -resume
This will use the WGET_DOWNLOADER process to securely download the files before proceeding with the standard cleaning and analysis steps.
For a detailed explanation of each step, advanced configuration, and tutorials, please visit our SpliceScape Wiki.
