A fast tool to find and visualize rearrangements in DNA sequences.
To install Smash++ on various operating systems, follow the instructions below. It requires CMake (>= 3.9) and a C++14 compliant compiler. Note that a precompiled executable is available for 64 bit operating systems in the experiment/bin directory.
Install Miniconda, then run the following:
conda install -y -c bioconda smashpp- Install Git and CMake:
sudo apt update
sudo apt install git cmake- Clone Smash++ and install it:
git clone https://github.com/smortezah/smashpp.git
cd smashpp
./install.sh- Install Homebrew, Git and CMake:
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
brew install git cmake- Clone Smash++ and install it:
git clone https://github.com/smortezah/smashpp.git
cd smashpp
./install.shInstall WSL (Windows Subsystem for Linux), then clone Smash++ and install it, like in Linux/macOS:
git clone https://github.com/smortezah/smashpp.git
cd smashpp
./install.sh./smashpp [OPTIONS] -r <REF-FILE> -t <TAR-FILE>
For example,
./smashpp -r ref -t tar
It is recommended to choose short names for reference and target sequences.
To see the possible options for Smash++, type:
./smashppwhich provides the following:
SYNOPSIS
./smashpp [OPTIONS] -r <REF-FILE> -t <TAR-FILE>
OPTIONS
Required:
-r <FILE> = reference file (Seq/FASTA/FASTQ)
-t <FILE> = target file (Seq/FASTA/FASTQ)
Optional:
-l <INT> = level of compression: [0, 6]. Default -> 3
-m <INT> = min segment size: [1, 4294967295] -> 50
-e <FLOAT> = entropy of 'N's: [0.0, 100.0] -> 2.0
-n <INT> = number of threads: [1, 255] -> 4
-f <INT> = filter size: [1, 4294967295] -> 100
-ft <INT/STRING> = filter type (windowing function): -> hann
{0/rectangular, 1/hamming, 2/hann,
3/blackman, 4/triangular, 5/welch,
6/sine, 7/nuttall}
-fs [S][M][L] = filter scale:
{S/small, M/medium, L/large}
-d <INT> = sampling steps -> 1
-th <FLOAT> = threshold: [0.0, 20.0] -> 1.5
-rb <INT> = ref beginning guard: [-32768, 32767] -> 0
-re <INT> = ref ending guard: [-32768, 32767] -> 0
-tb <INT> = tar beginning guard: [-32768, 32767] -> 0
-te <INT> = tar ending guard: [-32768, 32767] -> 0
-ar = consider asymmetric regions -> no
-nr = do NOT compute self complexity -> no
-sb = save sequence (input: FASTA/FASTQ) -> no
-sp = save profile (*.prf) -> no
-sf = save filtered file (*.fil) -> no
-ss = save segmented files (*.s[i]) -> no
-sa = save profile, filetered and -> no
segmented files
-rm k,[w,d,]ir,a,g/t,ir,a,g:...
-tm k,[w,d,]ir,a,g/t,ir,a,g:...
= parameters of models
<INT> k: context size
<INT> w: width of sketch in log2 form,
e.g., set 10 for w=2^10=1024
<INT> d: depth of sketch
<INT> ir: inverted repeat: {0, 1, 2}
0: regular (not inverted)
1: inverted, solely
2: both regular and inverted
<FLOAT> a: estimator
<FLOAT> g: forgetting factor: [0.0, 1.0)
<INT> t: threshold (no. substitutions)
-ll = list of compression levels
-h = usage guide
-v = more information
--version = show version
AUTHOR
Morteza Hosseini seyedmorteza@ua.pt
SAMPLE
./smashpp -r ref -t tar -l 0 -m 1000
To see the options for Smash++ Visualizer, type:
./smashpp -vizwhich provides the following:
SYNOPSIS
./smashpp -viz [OPTIONS] -o <SVG-FILE> <POS-FILE>
OPTIONS
Required:
<POS-FILE> = position file, generated by
Smash++ tool (*.pos)
Optional:
-o <SVG-FILE> = output image name (*.svg). Default -> map.svg
-rn <STRING> = reference name shown on output. If it
has spaces, use double quotes, e.g.
"Seq label". Default: name in header
of position file
-tn <STRING> = target name shown on output
-l <INT> = type of the link between maps: [1, 6] -> 1
-c <INT> = color mode: [0, 1] -> 0
-p <FLOAT> = opacity: [0.0, 1.0] -> 0.9
-w <INT> = width of the sequence: [8, 100] -> 10
-s <INT> = space between sequences: [5, 200] -> 40
-tc <INT> = total number of colors: [1, 255]
-rt <INT> = reference tick: [1, 4294967295]
-tt <INT> = target tick: [1, 4294967295]
-th [0][1] = tick human readable: 0=false, 1=true -> 1
-m <INT> = minimum block size: [1, 4294967295] -> 1
-vv = vertical view -> no
-nrr = do NOT show relative redundancy -> no
(relative complexity)
-nr = do NOT show redunadancy -> no
-ni = do NOT show inverse maps -> no
-ng = do NOT show regular maps -> no
-n = show 'N' bases -> no
-stat = save stats (*.csv) -> stat.csv
-h = usage guide
-v = more information
--version = show version
AUTHOR
Morteza Hosseini seyedmorteza@ua.pt
SAMPLE
./smashpp -viz -vv -o simil.svg ref.tar.pos
After installing Smash++, copy its executable file into example directory and go to that directory:
cp smashpp example/
cd example/There is in this directory two 1000 base sequences, the reference sequence named ref, and the target sequence, named tar. Now, run Smash++ and the visualizer:
./smashpp -r ref -t tar
./smashpp -viz -o example.svg ref.tar.posTo reproduce results in the paper, we have provided the Python script xp.py in the experiment/ directory, that can run Smash++ on synthetic and real genomic data. By this script, you can automatically make/download the datasets, in case of synthetic/real data, run Smash++ on those data using predefined parameters, and benchmark the method.
To use xp.py, you need to switch False to True for a desired dataset, in the beginnig of the file. Then, it runs Smash++ on that (those) dataset(s) and saves in the result/ directory the results including:
- a
*.posfile, which contains the positions of similar regions, plus self- and relative-redundancy values. It also includes in the header the parameters used to run Smash++, and sizes of the reference and the target files - a
*.svgfile with the similar regions visualized. This file is the output of Smash++ visualizer. - the
bench.csvfile, that provides time and memory usage of Smash++. In case of comparing with Smash (the first version), this file will provide the time and memory usage of Smash method, too. - in some cases, there would be a
*.csvfile, including the number of regular and inverted regions among the detected rearrangements. This file is generated when-statflag is enabled for Smash++ visualizer.
Note that xp.py requires conda for downloading the real dataset using Entrez Direct (EDirect) utility. If EDirect is not already installed, the script will automatically install it by conda.
Please cite the following, if you use Smash++:
- M. Hosseini, D. Pratas, B. Morgenstern, A.J. Pinho, "Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements," GigaScience, vol. 9, no. 5, 2020.
Please let us know if there is any issues.
Copyright © 2018-2020 Morteza Hosseini -- IEETA, University of Aveiro, Portugal.
Smash++ is licensed under GNU GPL v3.