GitHub - azofeifa/MDS: Algorithms required to compute Motif Displacement Scores

#Motif Displacement Calculator This package provides the necessary algorithms to scan for significant sites of TF-binding motifs at locations of regulatory DNA; i.e. enhancers and promoters. To compute a measure of co-occurrence between motifs and regulatory DNA, this package implements the so called motif displacement (MD) score which computes the proportion of motifs falling within some radius (-h) of all regulatory DNA centers against a larger local background (-H).

This package consists of two modules (DB/EVAL) and is invoked as below.

mpirun -np <.> SE  DB <paramater flags and values>

mpirun -np <.> SE  EVAL <paramater flags and values>

Evident from the run command, this c++ package requires three dependencies:

1. c++11
2. openmp (include <omp.h>)
3. mpi (include <mpi.h>)

To build

$ cd /CPP_src/
$ make clean
$ make

The make file requires the path to mpic++ (install and config openMPI) to be in your PATH. Installing and configuring your gcc compilers will likely be cause for a headache. I found these sites useful

#Modules

##DB A fair warning, running this module will likely take upwards of a week on a single node machine. In short, this module requires access to a large compute cluster. DB files that can be used for the eval module, in both human and mouse, are located within the PSSM_DB/. Fortunately, this file type is genome-build-free and so does not to be regenerated even if a new fasta file is used during the EVAL module. However, if you would like to re-estimate the GC distribution at your regulatory element of interest then go for it!

Flag	Type	Description
-ID	some string	An identifier, all output files will begin with this prefix
-bed	/path/to/.bed	A bed file over which GC content will be average (Tfit, MACs)
-DB	/path/to/file/from/<PSSM_DB>	A specific file from PSSM_DB/; gives the motif models
-o	/path/to/	Will output <-ID>.db; this will be required for the EVAL module
-fasta	/path/to/genome.fasta	fasta file of the same genome build as <-bed>
-log_out	/path/to	Where temporary and final log files will be generate <-ID>.log
-H	numerical	distance around which sequence will collected (default = 1500bp)
-pv	numerical	pvalue threshold under which a motif will be considered significant
-sim_N	numerical	number of random sequence generations; (default=10,000,000)

###Output file type (-o)

Above: A screen shot of a small porition of the db file that the DB module outputs. The file type is broken up into blocks according to the PSSM model (641 in human). Each block (delimited by the ~ symbol) contains the probability distribution matrix of each PSSM model. The final lines are the empiracle distribution of motif displacement estimated from the non-stationary GC content surrounding the inputted bed files.

##EVAL The EVAL module computes the so called motif displacement (MD) score. The module follows this generic pipeline:

Takes in as input a bed file corresponding to the regulatory DNA
Extracts the underlying nucleotide sequence from the provided fasta file (of the same build as the bed file)
Scans for significant motif sites (gathered from the file.db either from the PSSM_DB/ directory or generated from the DB module)
assess signficance of the MD score under binomial model (stationary background model) and (non-stationary background model estimated from the GC bias from the DB module).

Flag	Type	Description
-ID	some string	An identifier, all output files will begin with this prefix
-bed	/path/to/.bed	A bed file over which motif displacement will be calculated
-DB	/path/to/<PSSM_DB>	A specific file from PSSM_DB/; gives the motif models
-o	/path/to/	Will output <-ID>.tsv; this provides information on motif displacements and scores
-fasta	/path/to/genome.fasta	fasta file of the same genome build as <-bed>
-log_out	/path/to	Where temporary and final log files will be generate <-ID>.log
-bsn	numerical	number of random draws from the empiracle distribution estimated from DB module; (default=10,000)

#Advanced HPC usage Those using compute cluster may find this below qsub script useful!

#PBS -S /bin/bash

#PBS -N gTFIv2


#PBS -l walltime=72:00:00
#PBS -l nodes=10:ppn=64
#PBS -l mem=10gb
hostlist=$( cat $PBS_NODEFILE | sort | uniq | tr '\n' ',' | sed -e 's/,$//' )

# -- OpenMP environment variables --
OMP_NUM_THREADS=64
export OMP_NUM_THREADS
module load gcc_4.9.2
module load mpich_3.1.4

cmd="mpirun -np $PBS_NUM_NODES -hosts ${hostlist}"
src=/path/to/gTFIv2/CPP_src/repo/SE


$cmd $src <module> <parameter flags and values>

#Questions/Comments Please email me (Joey) at joseph[.]azofeifa[@]colorado[.]edu if you have any questions on usage or bug reports. Or open an Issue.

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
PSSM_DB		PSSM_DB
examples		examples
images		images
src		src
.gitignore		.gitignore
README.md		README.md
README.md~		README.md~

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages