SpliceScape

SpliceScape is a bioinformatics pipeline designed for the large-scale identification and characterization of splicing events from RNA-Seq data. Built using the Nextflow workflow orchestrator, it provides an efficient, reproducible, and scalable solution for generating comprehensive splicing landscapes of an organism.

The pipeline automates all critical steps of splicing analysis, from raw data processing to the final characterization of events. It integrates state-of-the-art tools to ensure high accuracy and performance, including:

Data Cleaning: BBDuK
Splicing-Aware Alignment: STAR
Splicing Event Identification and Quantification: MAJIQ and SGSeq

This approach allows SpliceScape to handle large public datasets, making it a powerful tool for comparative transcriptomics and for exploring the dynamics of RNA processing.

📊 The diagram below provides an overview of the SpliceScape workflow, from the initial input files to the final results.

✅ The pipeline is divided into distinct stages:

Pre-processing: Metadata acquisition and filtering.
Core Pipeline: Downloading reads, quality control (BBDuK), genome indexing (STAR), mapping (STAR), and splicing analysis (MAJIQ & SGSeq).
Post-processing: Parsing and integrating results into a unified database.

Quick Start

Dependencies

SpliceScape has several dependencies, below we list some of the main dependencies that must be installed prior to using the pipeline. We recommend to use the conda environment to install all dependencies.

Core Software & Tools

Nextflow (v24.10.5 or later)
STAR
BBMap (provides BBDuK v35.85)
Samtools
MAJIQ
ffq (from pip)
wget & md5sum (Standard Linux/macOS utilities)

Programming Environments

Python (v3.7 or higher)
R (v4.4)

Required Python Libraries

requests (v2.28.1)
beautifulsoup4 (v0.0.1)
biopython (v1.79)
pandas (v1.4.0)

Required R Libraries

optparse (v1.7.5)
SGSeq (v1.38.0)
GenomicFeatures (v1.56.0)

Input Files

SpliceScape requires three main inputs to start an analysis:

Reference Genome: An uncompressed reference genome file in .fasta format, preferably downloaded from Phytozome.
Genome Annotation: An uncompressed genome annotation file in .gff3 format, also from Phytozome.
Sample List: A plain text (.txt) file containing the SRA accessions for your target samples, with one identifier per line.

For detailed instructions on how to obtain and format these files, please refer to the SpliceScape Wiki.

Running

1. Clone the repository:

git clone https://github.com/labbces/SpliceScape.git
cd SpliceScape

2. Prepare input files:

SpliceScape is designed to work with the standard file formats and structures provided by the Phytozome database. For the pipeline to locate the genome and annotation files correctly, you must follow a specific directory structure.

The three main files required for each species are:

A reference genome in FASTA format (.fa). - From Phytozome
A genome annotation file in GFF3 format (.gff3). - From Phytozome
A plain text file (.txt) listing the target SRA accessions, with one per line.

A critical requirement is that both genome files must be uncompressed before running the pipeline. SpliceScape does not handle gzipped files (e.g., .fa.gz) for the genome and annotation inputs.

Recommended Directory Structure

We recommend creating a main directory for your project's input data and then creating a subdirectory for each species downloaded from Phytozome. Inside each species directory, you should have two subdirectories: assembly and annotation.

/path/to/your/project/
└── data/
    └── Phytozome/
        └── Athaliana_447_Araport11/      <-- Main species directory
            ├── assembly/
            │   └── Athaliana_447_TAIR10.fa   <-- Place the uncompressed FASTA file here
            └── annotation/
                └── Athaliana_447_Araport11.gene_exons.gff3  <-- Place the uncompressed GFF3 file here

For more details, please see our 0.3 Input Files: Preparing Genome and Transcriptome Data page

3. Configure the pipeline:

The pipeline is primarily configured using the splicescape_paired.config file. You must edit this file to provide the paths to your input files and executables. Below are the most critical parameters to set:

Section	"Name/Variable"	Description	Example
	workDir	Where files generated by Splicescape will be saved.	`"/home/Splicescape/results"`
params	outdir	Where Splicescape results will be saved (inside workDir).	`"${workDir}/output"`
	species	Species name - Must at least start with the Phytozome naming convention.	`"Athaliana_447"`
	reads_file	File containing SRR identifiers of interest (one per line).	`"/home/Splicescape/data/SRRs.txt"`
	genomeFASTA	Path to genome sequence of interest.	`"/home/Splicescape/data/Phytozome/Athaliana_447_Araport11/assembly/Athaliana_447_TAIR10.fa"`
	genomeGFF	Path to genome annotation.	`"/home/Splicescape/data/Phytozome/Athaliana_447_Araport11/annotation/Athaliana_447_Araport11.gene_exons.gff3"`
	genome_path	Path to Phytozome assembly directory.	`"/home/Splicescape/data/Phytozome/Athaliana_447_Araport11/assembly"`
	genome	Path containing both assembly and annotation.	`"/home/Splicescape/data/Phytozome/Athaliana_447_Araport11"`
	threads	Cores for genomic index generation with STAR.	`12`
	bbduk	Path to BBDuK executable.	`"/home/Splicescape/progs/bbmap_35.85/bbduk2.sh"`
	minlength	Minimum read length for BBDuK cleaning.	`60`
	trimq	Quality threshold for base trimming in sequencing reads (BBDuK).	`20`
	k	k-mer size used by BBDuK to find matches and remove specific sequences.	`27`
	rref	Path to file with sequences to be removed from reads (BBDuK).	`"/home/Splicescape/progs/BBMap/resources/adapters.fa"`
	maxmem	Maximum memory allocated for BBDuK.	`"20g"`
	r_libs	Path where R libraries are installed.	`"/home/Splicescape/R/library"`
	sgseq_cores	Maximum cores used by SGSeq.	`4`
	majiq_path	Path to MAJIQ bin directory.	`"/home/Splicescape/majiq/bin"`
	majiq_cores	Maximum cores used by MAJIQ.	`8`
executor	queueSize	Maximum number of files that can be simultaneously submitted to a cluster.	`50`
Activity Report (Report, timeline, trace)	enabled	Enables report generation.	`TRUE`
	file	Report filenames.	`"report.html"`, `"timeline.html"`, `"trace.txt"`
	overwrite	Allows overwriting files with same name (Caution when using -resume).	`TRUE`
SGE Cluster Profile	executor	Instructs Nextflow to submit each task as a "job" to SGE scheduler.	`sge`
	queue	Specifies default SGE queue for job submission.	`splicing.q@cluster`
	clusterOptions	Default options passed directly to SGE's qsub command for each job.	`-S /bin/bash -V -pe smp 2`
	scratch	Enables use of fast "scratch" directory on cluster for intermediate files.	`FALSE`
	maxForks	Limits execution to maximum concurrent processes of each type.	`5`
	withName	Overrides default settings for specific processes (resource allocation).	`"DOWNLOAD_READ_FTP { clusterOptions = '-S /bin/bash -pe smp 2 -l h_vmem=2G -V', maxForks = 10 }"`

4. Run SpliceScape:

Execute the pipeline using the nextflow run command. If you are using a cluster with a scheduler like SGE, you can use a profile.

nextflow run splicescape_paired.nf -c splicescape_paired.config -profile sge -resume

Please find this files in reads_processing.

☁️ Alternative Download from a Private Cloud

In cases where direct downloads from the NCBI SRA are not feasible or have failed, SpliceScape provides an alternative workflow to download reads from a private, password-protected server where the data has been pre-staged.

To use this method, you must:

Run the splicescape_paired_cloud.nf pipeline script instead of the default one.
Fill the following parameters for your private server to your .config file:

Parameter	Description
params.url	The base URL of the directory on the cloud server.
params.user	The username required for authentication.
params.password	The password required for authentication.

Your run command will then specify the alternative workflow script:

nextflow run splicescape_paired_cloud.nf -c your_config_file.config -profile sge -resume

This will use the WGET_DOWNLOADER process to securely download the files before proceeding with the standard cleaning and analysis steps.

Full Documentation

For a detailed explanation of each step, advanced configuration, and tutorials, please visit our SpliceScape Wiki.

Name		Name	Last commit message	Last commit date
Latest commit History 234 Commits
__pycache__		__pycache__
downstream_analysis		downstream_analysis
envs		envs
images		images
merging_outputs		merging_outputs
metadata		metadata
performance_analysis		performance_analysis
reads_processing		reads_processing
.gitignore		.gitignore
README.md		README.md
fixPairedRuns.py		fixPairedRuns.py
sgseq_parser_v2.py		sgseq_parser_v2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SpliceScape

📊 The diagram below provides an overview of the SpliceScape workflow, from the initial input files to the final results.

✅ The pipeline is divided into distinct stages:

Quick Start

Dependencies

Input Files

Running

1. Clone the repository:

2. Prepare input files:

3. Configure the pipeline:

4. Run SpliceScape:

☁️ Alternative Download from a Private Cloud

Full Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

labbces/SpliceScape

Folders and files

Latest commit

History

Repository files navigation

SpliceScape

📊 The diagram below provides an overview of the SpliceScape workflow, from the initial input files to the final results.

✅ The pipeline is divided into distinct stages:

Quick Start

Dependencies

Input Files

Running

1. Clone the repository:

2. Prepare input files:

3. Configure the pipeline:

4. Run SpliceScape:

☁️ Alternative Download from a Private Cloud

Full Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages