ShapeME - Shape-based Motif Elicitation

If you simply want to use our web interface to ShapeME, which is recommended for the vast majority of ShapeME users, go here.

ShapeME is a tool for finding informative motifs in DNA structure.

In contrast to conventional motif finders which search for motifs in sequence space, ShapeME instead identifies motifs in local DNA structure space using sequence-based predictions of DNA structure parameters including minor groove width, electrostatic potential, helical twist, propeller twist, and roll. DNA structural parameters are predicted from input sequences using DNAshapeR (Chiu et al. 2016).

ShapeME excels at discovering DNA structural motifs that explain binary data such as bound and unbound regions on a chromosome for a given factor, however, ShapeME is also more flexible than a typical motif finder and can be used to identify motifs predictive of data with more than two categories including RNA-seq expression levels, SELEX-seq enrichment scores, experiments with more than one condition being tested etc. Any score that can be binned into categories on a per-sequence basis can be used for motif inference using ShapeME. This powerful mutual information-based approach is inspired by the FIRE algorithm developed for sequence motif inference (Elemento et al. 2007).

For users who would like to run ShapeME on their own hardware, we recommend they download our Apptainer container using apptainer pull to run ShapeME on a linux operating system. This will require the user to install Apptainer on their system and add the Sylabs remote enpoint to their list of repositories to search for containers (brief instructions are below, but a user should visit read the official apptainer documentation).

The main script most users will run is ShapeME.py. See below for documentation on the required inputs to ShapeME.py.

Preparing input data

The input files required by ShapeME are:

scores file
- A tab-delimited file with one header line and two columns
- Column 1: "name" - the name of each sequence found in the fasta file, in the same order as sequences occur in fasta file.
- Column 2: "score" - the score associated with each sequence in the fasta files. For identification of motifs that inform peaks vs non-peaks, the score column should contain 0 or 1, where 0 would indicate a non-peak sequence, and 1 would indicate a peak seqeunce. The scores can also be categorical or continuous. If using continuous data, we recommend the user convert their scores to robust z-scores using a tool such as bgtools.
- For example scores files, see the txt files in the examples/binary_example, examples/categorical_example, and examples/continuous_example, directories.
sequence fasta file
- For example sequence fasta files, see the *.fa files in the examples/binary_example, examples/categorical_example, and examples/continuous_example, directories.

The sequence names in the fasta files and in the score file must be in the same order.

Calculating local shapes from sequences

Local shape files will be calculated for each sequence in your fasta file when the ShapeME.py script is run.

ShapeME.py will split your input data into the desired number of folds for k-fold cross validation (the number of folds being defined at the command line). Five shape files will be created for each set of training and testing data generated for each fold. The files will have the following names, where "*" will be replaced with a random sequence of characters generated by the tempfile python module, followed by "_main" or "_fold_k", where k is the zero-indexed fold number for crossfold validation.

*.fa.EP - minor groove electrostatic potential
*.fa.HelT - helical twist
*.fa.MGW - minor groove width
*.fa.ProT - propeller twist
*.fa.Roll - roll

Using continuous scores

We recommend that if the user is using continuous data that they first convert their scores to robust z-scores using a tool such as bgtools, then manually create the required input file with paired sequence names and scores.

Then, when running ShapeME.py, set --continuous <n> at the command line, where <n> must be replaced with the number of equally populated bins to quantize the continuous scores into.

Running ShapeME

ShapeME can be run to detect only shape motifs, only sequence motifs (in this case ShapeME is basically a wrapper for STREME with extra motif pruning steps to avoid reporting motifs with overlapping information), or to incorporate shape and sequence motifs into a single model.

To run ShapeME, we recommend using our web interface. However, if you prefer to use the ShapeME CLI, you can pull an Apptainer container using these brief instructions.

We distribute ShapeME as an Apptainer container, which can be run on any computer with a Linux environment that has Apptainer installed.

For more information on how to set up apptainer for your system please follow the apptainer documentation here.

Once you have set up apptainer and added the SylabsCloud repository to your remote endpoints by running apptainer remote add SylabsCloud cloud.sycloud.io, you will need to direct apptainer to use the SylabsCloud repository by running apptainer remote use SylabsCloud. You should now be able to pull the latest version of our container by running apptainer pull shapeme.sif library://schroedj/appliances/shapeme:latest. Verify the container by running apptainer verify shapeme.sif.

In all instructions below, you should substitute /path/to with the actual path to the location with the ShapeME Apptainer container.

Inference on provided example data

For the following examples, we recommend the user download the example data provided with this repository.

Binary input values

Only shape motifs

From within the examples/binary_example directory, run the following, updating the number of processors (nprocs) to something that is suitable to the system on which you are running ShapeME:

nprocs=8
data_dir=$(pwd)

apptainer exec -B ${data_dir} \
    /path/to/shapeme.sif \
    python /src/python3/ShapeME.py infer \
        --data_dir ${data_dir} \
        --seq_fasta seqs.fa \
        --score_file seqs.txt \
        --crossval_folds 5 \
        --nprocs ${nprocs} \
        > ${data_dir}/shapeme.log \
        2> ${data_dir}/shapeme.err

Only sequence motifs

From within the examples/binary_example directory, run the code below, updating the value of the number of processors (nprocs) to something that is suitable to the system on which you are running ShapeME.

In the below code, --seq_motif_positive_cats 1 is set. The --seq_motif_positive_cats argument sets which categories in the --score_file will be assigned to the "positive set" for STREME. We've set it to 1 here because this example is simple binary data, with 0 as the negative set and 1 as the positive set. For categorical data, we would set the positive set with a comma-separated list of the categories to be considered as the "positive" set by STREME during sequence motif finding. For instance, if you are using scores with 10 bins (values of 0 through 9) and if want to identify motifs in bins 8 and 9, set --seq_motif_positive_cats 8,9 at the command line.

nprocs=8
data_dir=$(pwd)

apptainer exec -B ${data_dir} \
    /path/to/shapeme.sif \
    python /src/python3/ShapeME.py infer \
        --find_seq_motifs \
        --no_shape_motifs \
        --data_dir ${data_dir} \
        --seq_fasta seqs.fa \
        --seq_motif_positive_cats 1 \
        --score_file seqs.txt \
        --crossval_folds 5 \
        --nprocs ${nprocs} \
        > ${data_dir}/shapeme.log \
        2> ${data_dir}/shapeme.err

Infer both shape and sequence motifs

From within the examples/binary_example directory, run the following, noting that the command below is identical to that in the above section for detecting only shape motifs, except that this command does not apply the --no_shape_motifs flag.

nprocs=8
data_dir=$(pwd)

apptainer exec -B ${data_dir} \
    /path/to/shapeme.sif \
    python /src/python3/ShapeME.py infer \
        --find_seq_motifs \
        --data_dir ${data_dir} \
        --seq_fasta seqs.fa \
        --seq_motif_positive_cats 1 \
        --score_file seqs.txt \
        --crossval_folds 5 \
        --nprocs ${nprocs} \
        > ${data_dir}/shapeme.log \
        2> ${data_dir}/shapeme.err

Categorical input values

Only shape motifs

From within the examples/categorical_example directory, run the following, updating the value of the number of processors (nprocs) to something that is suitable to the system on which you are running ShapeME:

nprocs=8
data_dir=$(pwd)

apptainer exec -B ${data_dir} \
    /path/to/shapeme.sif \
    python /src/python3/ShapeME.py infer \
        --data_dir ${data_dir} \
        --seq_fasta seqs.fa \
        --score_file seqs.txt \
        --crossval_folds 5 \
        --nprocs ${nprocs} \
        > ${data_dir}/shapeme.log \
        2> ${data_dir}/shapeme.err

Only sequence motifs

From within the examples/categorical_example directory, run the following, updating the value of the number of processors (nprocs) to something that is suitable to the system on which you are running ShapeME:

nprocs=8
data_dir=$(pwd)

apptainer exec -B ${data_dir} \
    /path/to/shapeme.sif \
    python /src/python3/ShapeME.py infer \
        --find_seq_motifs \
        --no_shape_motifs \
        --data_dir ${data_dir} \
        --seq_fasta seqs.fa \
        --seq_motif_positive_cats 4 \
        --score_file seqs.txt \
        --crossval_folds 5 \
        --nprocs ${nprocs} \
        > ${data_dir}/shapeme.log \
        2> ${data_dir}/shapeme.err

Infer both shape and sequence motifs

From within the examples/categorical_example directory, run the following, noting that the command below is identical to that in the above section for detecting only shape motifs, except that this command does not apply the --no_shape_motifs flag.

nprocs=8
data_dir=$(pwd)

apptainer exec -B ${data_dir} \
    /path/to/shapeme.sif \
    python /src/python3/ShapeME.py infer \
        --find_seq_motifs \
        --data_dir ${data_dir} \
        --seq_fasta seqs.fa \
        --seq_motif_positive_cats 4 \
        --score_file seqs.txt \
        --crossval_folds 5 \
        --nprocs ${nprocs} \
        > ${data_dir}/shapeme.log \
        2> ${data_dir}/shapeme.err

Continuous input values

Only shape motifs

From within the examples/continuous_example directory, run the following, updating the value of the number of processors (nprocs) to something that is suitable to the system on which you are running ShapeME:

nprocs=8
data_dir=$(pwd)

apptainer exec -B ${data_dir} \
    /path/to/shapeme.sif \
    python /src/python3/ShapeME.py infer \
        --data_dir ${data_dir} \
        --seq_fasta seqs.fa \
        --score_file seqs.txt \
        --continuous 10 \
        --crossval_folds 5 \
        --nprocs ${nprocs} \
        > ${data_dir}/shapeme.log \
        2> ${data_dir}/shapeme.err

Only sequence motifs

From within the examples/continuous_example directory, run the following, updating the value of the number of processors (nprocs) to something that is suitable to the system on which you are running ShapeME:

nprocs=8
data_dir=$(pwd)

apptainer exec -B ${data_dir} \
    /path/to/shapeme.sif \
    python /src/python3/ShapeME.py infer \
        --find_seq_motifs \
        --no_shape_motifs \
        --data_dir ${data_dir} \
        --seq_fasta seqs.fa \
        --seq_motif_positive_cats 8,9 \
        --score_file seqs.txt \
        --continuous 10 \
        --crossval_folds 5 \
        --nprocs ${nprocs} \
        > ${data_dir}/shapeme.log \
        2> ${data_dir}/shapeme.err

Infer both shape and sequence motifs

From within the examples/continuous_example directory, run the following, noting that the command below is identical to that in the above section for detecting only shape motifs, except that this command does not apply the --no_shape_motifs flag.

nprocs=8
data_dir=$(pwd)

apptainer exec -B ${data_dir} \
    /path/to/shapeme.sif \
    python /src/python3/ShapeME.py infer \
        --find_seq_motifs \
        --data_dir ${data_dir} \
        --seq_fasta seqs.fa \
        --seq_motif_positive_cats 8,9 \
        --score_file seqs.txt \
        --continuous 10 \
        --crossval_folds 5 \
        --nprocs ${nprocs} \
        > ${data_dir}/shapeme.log \
        2> ${data_dir}/shapeme.err

Using your own data

Enter the directory containing your sequence files, shape files, and input score files. The following required arguments must be present, but read below further for a comprehensive list of arguments that can be accepted by ShapeME:

<data_dir>
- The location containing your input fasta file and scores file.
<seq_fasta>
- The name of the file containing the fasta records. Use only the basename. The file must be in --data_dir.
<score_file>
- The file containing input scores, which could be binary scores, categorical scores, or continuous values. If they are continuous, you MUST set the --continuous flag at the command line to set the number of bins into which to discretize the input scores. For instance, --continuous 10 would create 10 approximately evenly populated bins into which to allocate the input data. Use only the basename. The file must be in --data_dir.
<crossval_folds>
- We usually set this to 5, indicating that in addition to a round of motif inference on the entire dataset, 5-fold crossvalidation will be performed to assess the reproducibility of motif inference on your data.

In addition to the required arguments above, one additional argument to be aware of is --nprocs:

<nprocs>
- Sets the number of parallel processes to use for shape motif inference.
- The value you use will depend on the available resources of your system.
- We have found that this algorithm scales well to up to 32 cores, but beyond that, returns diminish quickly.

ShapeME.py arguments

In addition to the arguments demonstrated in the examples, there are many other arguments that can be set to adjust ShapeME behavior. Below is the output of running python ShapeME.py infer --help, which give the exhaustive list of arguments that can be passed to ShapeME.py infer, along with a brief description of each argument.

usage: ShapeME.py infer [-h] --data_dir DATA_DIR [--skip_inference]
                        [--skip_evaluation] [--out_prefix OUT_PREFIX]
                        [--force] --crossval_folds CROSSVAL_FOLDS --score_file
                        SCORE_FILE [--kmer KMER] [--max_count MAX_COUNT]
                        [--continuous CONTINUOUS]
                        [--threshold_sd THRESHOLD_SD]
                        [--init_threshold_seed_num INIT_THRESHOLD_SEED_NUM]
                        [--init_threshold_recs_per_seed INIT_THRESHOLD_RECS_PER_SEED]
                        [--init_threshold_windows_per_record INIT_THRESHOLD_WINDOWS_PER_RECORD]
                        [--max_batch_no_new_seed MAX_BATCH_NO_NEW_SEED]
                        [--nprocs NPROCS]
                        [--threshold_constraints THRESHOLD_CONSTRAINTS THRESHOLD_CONSTRAINTS]
                        [--shape_constraints SHAPE_CONSTRAINTS SHAPE_CONSTRAINTS]
                        [--weights_constraints WEIGHTS_CONSTRAINTS WEIGHTS_CONSTRAINTS]
                        [--temperature TEMPERATURE] [--t_adj T_ADJ]
                        [--stepsize STEPSIZE] [--opt_niter OPT_NITER]
                        [--alpha ALPHA] [--batch_size BATCH_SIZE]
                        [--find_seq_motifs] [--no_shape_motifs]
                        [--seq_fasta SEQ_FASTA]
                        [--seq_motif_positive_cats SEQ_MOTIF_POSITIVE_CATS]
                        [--streme_thresh STREME_THRESH]
                        [--seq_meme_file SEQ_MEME_FILE] [--write_all_files]
                        [--exhaustive] [--max_n MAX_N] [--log_level LOG_LEVEL]

options:
  -h, --help            show this help message and exit
  --data_dir DATA_DIR   Directory from which input files will be read.
  --skip_inference      Include this flag at the command line to skip motif
                        inference. This is useful if you've already run
                        inference on all folds.
  --skip_evaluation     Include this flag at the command line to skip
                        evaluation of motifs.
  --out_prefix OUT_PREFIX
                        Optional. Sets the prefix to place on output
                        directories. If None (the default), infers the
                        appropriate prefix based on whether you set either or
                        both of --find_seq_motifs or --no_shape_motifs
  --force               Forces each fold to run, clobbering any extant output
                        directories.
  --crossval_folds CROSSVAL_FOLDS
                        Number of folds into which to split data for k-fold
                        cross-validation
  --score_file SCORE_FILE
                        input text file with names and scores for training
                        data
  --kmer KMER           kmer size to search for shape motifs. Default=15
  --max_count MAX_COUNT
                        Maximum number of times a motif can match each of the
                        forward and reverse strands in a reference. Default: 1
  --continuous CONTINUOUS
                        number of bins to discretize continuous input data
                        with
  --threshold_sd THRESHOLD_SD
                        std deviations below mean for seed finding. Only
                        matters for greedy search. Default=2.000000
  --init_threshold_seed_num INIT_THRESHOLD_SEED_NUM
                        Number of randomly selected seeds to compare to
                        records in the database during initial threshold
                        setting. Default=500
  --init_threshold_recs_per_seed INIT_THRESHOLD_RECS_PER_SEED
                        Number of randomly selected records to compare to each
                        seed during initial threshold setting. Default=20
  --init_threshold_windows_per_record INIT_THRESHOLD_WINDOWS_PER_RECORD
                        Number of randomly selected windows within a given
                        record to compare to each seed during initial
                        threshold setting. Default=2
  --max_batch_no_new_seed MAX_BATCH_NO_NEW_SEED
                        Sets the number of batches of seed evaluation with no
                        new motifs added to the set of motifs to be optimized
                        prior to truncating the initial search for motifs.
  --nprocs NPROCS       number of processors. Default: 1
  --threshold_constraints THRESHOLD_CONSTRAINTS THRESHOLD_CONSTRAINTS
                        Sets the upper and lower limits on the match threshold
                        during optimization. Defaults to 0 for the lower limit
                        and 10 for the upper limit.
  --shape_constraints SHAPE_CONSTRAINTS SHAPE_CONSTRAINTS
                        Sets the upper and lower limits on the shapes'
                        z-scores during optimization. Defaults to -4 for the
                        lower limit and 4 for the upper limit.
  --weights_constraints WEIGHTS_CONSTRAINTS WEIGHTS_CONSTRAINTS
                        Sets the upper and lower limits on the pre-
                        transformed, pre-normalized weights during
                        optimization. Defaults to -4 for the lower limit and 4
                        for the upper limit.
  --temperature TEMPERATURE
                        Sets the temperature argument for simulated annealing.
                        Default: 0.400000
  --t_adj T_ADJ         Fraction by which temperature decreases each iteration
                        of simulated annealing. Default: 0.001000
  --stepsize STEPSIZE   Sets the stepsize argument simulated annealing. This
                        defines how far a given value can be modified for
                        iteration i from its value at iteration i-1. A higher
                        value will allow farther hops. Default: 0.250000
  --opt_niter OPT_NITER
                        Sets the number of simulated annealing iterations to
                        undergo during optimization. Default: 10000.
  --alpha ALPHA         Lower limit on transformed weight values prior to
                        normalization to sum to 1. Default: 0.000000
  --batch_size BATCH_SIZE
                        Number of records to process seeds from at a time. Set
                        lower to avoid out-of-memory errors. Default: 2000
  --find_seq_motifs     Add this flag to call sequence motifs using streme in
                        addition to calling shape motifs.
  --no_shape_motifs     Add this flag to turn off shape motif inference. This
                        is useful if you basically want to use this script as
                        a wrapper for streme to just find sequence motifs.
  --seq_fasta SEQ_FASTA
                        Name of fasta file (located within data_dir, do not
                        include the directory, just the file name) containing
                        sequences in which to search for motifs
  --seq_motif_positive_cats SEQ_MOTIF_POSITIVE_CATS
                        Denotes which categories in `--infile` (or after
                        quantization for a continous signal in the number of
                        bins denoted by the `--continuous` argument) to use as
                        the positive set for sequence motif calling using
                        streme. Example: "4" would use category 4 as the
                        positive set, whereas "3,4" would use categories 3 and
                        4 as the positive set.
  --streme_thresh STREME_THRESH
                        Threshold for including motifs identified by streme.
                        Default: 0.050000
  --seq_meme_file SEQ_MEME_FILE
                        Name of meme-formatted file (file must be located in
                        data_dir) to be used for searching for known sequence
                        motifs of interest in seq_fasta
  --write_all_files     Add this flag to write all motif meme files,
                        regardless of whether the model with shape motifs,
                        sequence motifs, or both types of motifs was most
                        performant.
  --exhaustive          Add this flag to perform and exhaustive initial search
                        for seeds. This can take a very long time for datasets
                        with more than a few-thousand binding sites. Setting
                        this option will override the --max_rounds_no_new_seed
                        option.
  --max_n MAX_N         Sets the maximum number of fasta records to use for
                        motif inference. This is useful when runs are taking
                        prohibitively long.
  --log_level LOG_LEVEL
                        Sets log level for logging module. Valid values are
                        DEBUG, INFO, WARNING, ERROR, CRITICAL.

Expected output

Several files are generated during a given ShapeME run.

The first file of interest to most users will be {data_dir}/shape_main_output/report.html, where {data_dir} is the directory indicated by the --data_dir argument used in the ShapeME.py infer call. Opening the report in a web browser will provide three summaries of the ShapeME run:

A summary of the motifs identified, including each motif's logo, name, adjusted mutual information, z-score, robustness, and e-value. In the logo representation, opacity of a shape symbol denotes the weight for that shape at that position in the motif. A transparent symbol denotes the weight is essentially zero, and that the shape's value is unimportant at that position.
A heatmap with each motif's log2(fold-enrichment) in each score category is present in the report. For shape motifs, a row is present for each informative "hit" category for the given motif. For example, if --max_count 2 was set, the possible hit categories for a motif are [0,1], [0,2], [1,1], [1,2], and [2,2]. If each hit category was informative, each will have a row in the heatmap to indicate for that motif, how enriched that hit category was in each score category (score categories are columns in the heatmap). For sequence motifs a single row will be present in the heatmap, as the hits categories for sequence motifs can only be 0 or 1.

If an error was encountered during the ShapeME run, then the report will simply be a web page containing the text of the error message.

The second file of interest is {data_dir}/shape_main_output/final_motifs.dsm, which contains the numeric values for each motif, including each shape's z-score at each position in the motif, and the weight applied to each shape at each position in the motif. For users familiar with the meme file format, the dsm file will look familiar, as its format was inspired by the meme format.

Building the Apptainer container locally

Note that if apptainer is set up on you system, you should not need to build your own container. It is our intent that users will submit jobs via the web interface to ShapeME, and should a user desire to use ShapeME at the command line, that they use the instructions above to pull the container directly from the Sylabs remote endpoint.

If you should find you need to build your own Apptainer container locally rather than use our pre-built container hosted at cloud.sylabs.io, take the following steps, substituting <src_direc> and <path/to/container.sif> with the location in which you would like the ShapeME source code and the absolute path to the container after it is built, respectively:

# enter the directory you wish to contain the ShapeME source code
cd <src_direc>
git clone https://github.com/freddolino-lab/ShapeME.git
cd ShapeME/singularity
apptainer build <path/to/container.sif> shapeme.def

Note that you must have Apptainer installed on your system to build or run the container.

Name		Name	Last commit message	Last commit date
Latest commit History 529 Commits
docs		docs
examples		examples
singularity		singularity
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ShapeME - Shape-based Motif Elicitation

Preparing input data

Calculating local shapes from sequences

Using continuous scores

Running ShapeME

Inference on provided example data

Binary input values

Only shape motifs

Only sequence motifs

Infer both shape and sequence motifs

Categorical input values

Only shape motifs

Only sequence motifs

Infer both shape and sequence motifs

Continuous input values

Only shape motifs

Only sequence motifs

Infer both shape and sequence motifs

Using your own data

ShapeME.py arguments

Expected output

Building the Apptainer container locally

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

freddolino-lab/ShapeME

Folders and files

Latest commit

History

Repository files navigation

ShapeME - Shape-based Motif Elicitation

Preparing input data

Calculating local shapes from sequences

Using continuous scores

Running ShapeME

Inference on provided example data

Binary input values

Only shape motifs

Only sequence motifs

Infer both shape and sequence motifs

Categorical input values

Only shape motifs

Only sequence motifs

Infer both shape and sequence motifs

Continuous input values

Only shape motifs

Only sequence motifs

Infer both shape and sequence motifs

Using your own data

ShapeME.py arguments

Expected output

Building the Apptainer container locally

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages