#Motif Displacement Calculator This package provides the necessary algorithms to scan for significant sites of TF-binding motifs at locations of regulatory DNA; i.e. enhancers and promoters. To compute a measure of co-occurrence between motifs and regulatory DNA, this package implements the so called motif displacement (MD) score which computes the proportion of motifs falling within some radius (-h) of all regulatory DNA centers against a larger local background (-H).
This package consists of two modules (DB/EVAL) and is invoked as below.
mpirun -np <.> SE DB <paramater flags and values>
mpirun -np <.> SE EVAL <paramater flags and values>
Evident from the run command, this c++ package requires three dependencies:
1. c++11
2. openmp (include <omp.h>)
3. mpi (include <mpi.h>)
To build
$ cd /CPP_src/
$ make clean
$ make
The make file requires the path to mpic++ (install and config openMPI) to be in your PATH. Installing and configuring your gcc compilers will likely be cause for a headache. I found these sites useful
#Modules
##DB A fair warning, running this module will likely take upwards of a week on a single node machine. In short, this module requires access to a large compute cluster. DB files that can be used for the eval module, in both human and mouse, are located within the PSSM_DB/. Fortunately, this file type is genome-build-free and so does not to be regenerated even if a new fasta file is used during the EVAL module. However, if you would like to re-estimate the GC distribution at your regulatory element of interest then go for it!
| Flag | Type | Description |
|---|---|---|
| -ID | some string | An identifier, all output files will begin with this prefix |
| -bed | /path/to/.bed | A bed file over which GC content will be average (Tfit, MACs) |
| -DB | /path/to/file/from/<PSSM_DB> | A specific file from PSSM_DB/; gives the motif models |
| -o | /path/to/ | Will output <-ID>.db; this will be required for the EVAL module |
| -fasta | /path/to/genome.fasta | fasta file of the same genome build as <-bed> |
| -log_out | /path/to | Where temporary and final log files will be generate <-ID>.log |
| -H | numerical | distance around which sequence will collected (default = 1500bp) |
| -pv | numerical | pvalue threshold under which a motif will be considered significant |
| -sim_N | numerical | number of random sequence generations; (default=10,000,000) |
###Output file type (-o)
Above: A screen shot of a small porition of the db file that the DB module outputs. The file type is broken up into blocks according to the PSSM model (641 in human). Each block (delimited by the ~ symbol) contains the probability distribution matrix of each PSSM model. The final lines are the empiracle distribution of motif displacement estimated from the non-stationary GC content surrounding the inputted bed files.
##EVAL The EVAL module computes the so called motif displacement (MD) score. The module follows this generic pipeline:
- Takes in as input a bed file corresponding to the regulatory DNA
- Extracts the underlying nucleotide sequence from the provided fasta file (of the same build as the bed file)
- Scans for significant motif sites (gathered from the file.db either from the PSSM_DB/ directory or generated from the DB module)
- assess signficance of the MD score under binomial model (stationary background model) and (non-stationary background model estimated from the GC bias from the DB module).
| Flag | Type | Description |
|---|---|---|
| -ID | some string | An identifier, all output files will begin with this prefix |
| -bed | /path/to/.bed | A bed file over which motif displacement will be calculated |
| -DB | /path/to/<PSSM_DB> | A specific file from PSSM_DB/; gives the motif models |
| -o | /path/to/ | Will output <-ID>.tsv; this provides information on motif displacements and scores |
| -fasta | /path/to/genome.fasta | fasta file of the same genome build as <-bed> |
| -log_out | /path/to | Where temporary and final log files will be generate <-ID>.log |
| -bsn | numerical | number of random draws from the empiracle distribution estimated from DB module; (default=10,000) |
#Advanced HPC usage Those using compute cluster may find this below qsub script useful!
#PBS -S /bin/bash
#PBS -N gTFIv2
#PBS -l walltime=72:00:00
#PBS -l nodes=10:ppn=64
#PBS -l mem=10gb
hostlist=$( cat $PBS_NODEFILE | sort | uniq | tr '\n' ',' | sed -e 's/,$//' )
# -- OpenMP environment variables --
OMP_NUM_THREADS=64
export OMP_NUM_THREADS
module load gcc_4.9.2
module load mpich_3.1.4
cmd="mpirun -np $PBS_NUM_NODES -hosts ${hostlist}"
src=/path/to/gTFIv2/CPP_src/repo/SE
$cmd $src <module> <parameter flags and values>
#Questions/Comments Please email me (Joey) at joseph[.]azofeifa[@]colorado[.]edu if you have any questions on usage or bug reports. Or open an Issue.
