This source tree contains code suitable for running read processing and scoring of protein occupancy in IPOD data sets using version 2 of the processing pipeline (version 1 is an outgrowth of the methods described in this paper).
This repository contains code for performing IPOD data analysis. The code is based on the methods described in this paper, but has been significantly modified since prior distributions of the IPOD analysis code were prepared.
The analysis pipeline provided here is reliant on several excellent pieces of open source software,
and in some cases requires specific versions in order to function properly.
To simplify the process of establishing a compatible environment,
we provide a conda environment definition in the accompanying file conda_environment.yml
that will provide nearly all tools necessary for running IPOD-HR;
this can be used as a checklist (and guide to specific required versions) even if not working in a conda environment.
This pipeline requires that your data be configured in specific locations. Additionally, certain file naming conventions must be followed, and a configuration file must be present in order to run the pipeline. For instructions on setting up the data locations and file names, see the File preparation section below.
To aid in the distribution and use of our pipeline, we provide an Apptainer container. For instructions on use of the Apptainer container, see the Apptainer Use section below.
The IPOD processing pipeline requires a strict file hierarchy in order to properly find and process the data files.
The hierarchy should match the following scheme for an expreiment with two replicates for each condition (rep1 and rep2):
top-level-directory
+-- main.conf
+-- conditions.txt
+-- control
| +-- control.conf
| +-- chip
| +-- raw
| +-- control_chip_rep1_R1.fq
| +-- control_chip_rep1_R2.fq
| +-- control_chip_rep2_R1.fq
| +-- control_chip_rep2_R2.fq
| +-- ipod
| +-- raw
| +-- control_ipod_rep1_R1.fq
| +-- control_ipod_rep1_R2.fq
| +-- control_ipod_rep2_R1.fq
| +-- control_ipod_rep2_R2.fq
| +-- inp
| +-- raw
| +-- control_input_rep1_R1.fq
| +-- control_input_rep2_R2.fq
| +-- control_input_rep2_R1.fq
| +-- control_input_rep2_R2.fq
+-- case
| +-- case.conf
| +-- chip
| +-- raw
| +-- case_chip_rep1_R1.fq
| +-- case_chip_rep1_R2.fq
| +-- case_chip_rep2_R1.fq
| +-- case_chip_rep2_R2.fq
| +-- ipod
| +-- raw
| +-- case_ipod_rep1_R1.fq
| +-- case_ipod_rep1_R2.fq
| +-- case_ipod_rep2_R1.fq
| +-- case_ipod_rep2_R2.fq
| +-- inp
| +-- raw
| +-- case_input_rep1_R1.fq
| +-- case_input_rep1_R2.fq
| +-- case_input_rep2_R1.fq
| +-- case_input_rep2_R2.fq
The hierarchy above and the listed files are further explained in the sub-sections below.
At the top level directory,
there must be a configuration file to set options for each step in the pipeline.
In the example tree above, the file is calles main.conf. It is a
TOML document. The user should familiarize themselves with
each of the configurable options in the top-level configuration file.
These options are documented in more detail in the
main_config.md file.
Also in the top level directory,
there must be a text file containing directory/configuration pairs.
In the tree above, this file is named conditions.txt, and is formatted as
<directory1> <directory1>.conf
<directory2> <directory2>.conf
etc...
In the example of the tree above, "control" would be substituted for "<directory1>", and "case" would be substituted for "<directory2>".
All of the specified directories must exist in the top-level directory, and must contain a configuration file corresponding to the config name given. We refer to each of the specified subdirectories as condition directories in the documentation below; typically each one will represent all of the data present on a single biological condition.
Each condition directory must contain a condition-level configuration file
(as noted above as conditions.txt), as well as three subdirectories named
ipod, inp, and chip.
We collectively refer to those subdirectories as the data directories;
they correspond to the IPOD-HR, input, and RNA polymerase ChIP-seq samples
(ipod, inp, and chip directories, respectively, in the tree above),
respectively, under the biological condition being considered.
The condition-level configuration file is a TOML document; formatting details are given in the accompanying condition_config.md file.
Each data directory must contain:
- A directory called
raw. - Within the
rawdirectory, there must exist either the read files referenced in the condition-level configuration file for this sample type, or symlinks referenced in the condition-level configuration file which point to the actual read files which may reside elsewhere.
Assuming that the installattion prerequisites described above have been met,
that your current working directory is the top level directory of your project
(i.e., the directory containing the main configuration file),
and that this source code distribution is present in a directory called {SRC_LOC},
the first stage of the pipeline (from raw read processing through
quantification of IPOD signals and RNAP ChIP-seq signals)
can be run using the python program in {SRC_LOC}/drivers/run_all_driver.py,
specifying as a single command line argument the path to the main.conf
configuration file. For example, assuming the current directory is
top-level-directory from the example tree above, running
python {SRC_LOC}/drivers/run_all_driver.py main.confwould run the steps of the pipeline from processing and aligning reads through calculating protein occupancy and RNAP ChIP-seq enrichment.
Likewise, running the second stage of the pipeline as below
python {SRC_LOC}/drivers/do_peak_and_epod_calls.py main.confwill call peaks in RNAP ChIP-seq and IPOD data, and will call EPODs.
Note that for each of run_all_driver.py and do_peak_and_epod_calls.py, discrete
steps of the pipeline in each script may be skipped. The help documentation
for each script contains a brief description of which steps may be skipped,
and how to skip them. The help documentation can be accessed by invoking either
script followed by -h. For instance:
python {SRC_LOC}/drivers/run_all_driver.py -hWill display the command line options that can be passed to run_all_driver.py and
a brief description of each option.
The IPOD-HR analysis pipeline will produce several intermediate files as well as
a final set of results.
Intermediate files are typically the results of individual pipeline steps
(e.g., running bowtie2).
The "quantification" step writes numerous bedgraph files, each
containing some quantity at each position in the reference genome for
the chosen resolution of your analysis. The final results will be written
to the directory specified in your main configuration file
at bootstrap -> output_path. In the example file names below {OUTPREFIX}
will be substituted by the value you provide in each condition-level
configuration file at general -> out_prefix.
Typical files of interest include:
{OUTPREFIX}_IPOD_vs_inp_lograt_mean.bedgraph-- bedgraph file containing the estimate of the mean log2-transformed IPOD signal enrichment over input DNA.{OUTPREFIX}_IPOD_vs_inp_rzlograt_mean.bedgraph-- bedgraph file containing the estimate of the mean robust z-score for IPOD enrichment over input DNA.{OUTPREFIX}_IPOD_vs_inp_rzlogratlog10p_mean.bedgraph-- same information as OUTPREFIX_IPOD_vs_inp_rzlograt_mean.bedgraph, but the values have been transformed to represent -log10(p-values).- Confidence limits for the above estimates, calculated by jacknife sampling,
at the confidence level specified in
quant -> alphain your main configuration file. - If your experimental design included paired sample types, i.e., your chip, ipod, and input data came from the same sample of a culture, then for each of the above files, a file will be generated for each replicate with the same information.
- All the same files as above, but for your RNAP ChIP-seq data. "IPOD" in the file names will be substituted with CHIP for the RNAP ChIP-seq files.
- If you are subtracting the contribution of RNAP from your IPOD signal, similar files as above, but containing the ChIP-subtracted IPOD signal will be generated. These files follow the same naming convention used for the above files, but will contain "chipsub" in their names.
We typically call peaks at several thresholds. In the case that you have
paired data, peaks are called in each replicate and peaks from individual
replicates are pooled
using the irreproducible discovery rate (IDR). Files are written to the
location specified by peaks -> output_path in the main configuration file.
In the file names below <val> is substituted with the threshold above
which peaks were called for that file. Note that if you did not perform
RNAP signal subtraction, "rzchipsub" will be substituted in the file names
below with "chipsub". Additionally, if you called peaks
in your RNAP ChIP-seq data, "IPOD" will be substututed with "CHIP".
Useful files output by the peak caller are:
best_threshold/{OUTPREFIX}_CHIP_vs_inp_rzlograt_cutoff_<val>_idr_passed.narrowpeak-- The peaks passing the idr threshold set inidr -> threshold, at the peak calling threshold decided to be the "best threshold", of all the peak calling thresholds used. The best threshold is decided based on the Kullback–Leibler (KL) divergence between the scores found within peak regions and within non-peak regions. This is usually the file that should be used as the final set of peaks.best_threshold/kl_div_cutoff_{OUTPREFIX}_CHIP_vs_inp_rzlograt_cutoff_<val>_idr_passed.narrowpeak.png-- This plot shows the KL divergence between signals of peak vs. non-peak regions at each threshold at which peaks were called. This plot is useful for diagnosing whether automated detection of the "best threshold" is preforming as expected. Often, the threshold with the hightest KL divergence is selected as the best threhold, but in the case where there are thresholds for which the upper 95% CL for KL divergence is higher than the maximum observed KL divergence, the highest threshold for which the upper 95% CL contains the max observed KL divergence will be selected as the "best threshold".{OUTPREFIX}_CHIP_rep1_cutoff_<val>_peaks.narrowpeak-- The peaks identified in replicate 1. A file like this will be generated for each replicate.{OUTPREFIX}_CHIP_cutoff_<val>_idr_passed.narrowpeak-- This file contains only the peaks which had an IDR below the threshold denoted byidr -> thresholdin the main configuration file.{OUTPREFIX}_CHIP_rep2_cutoff_<val>_peaks_vs_{OUTPREFIX}_CHIP_rep1_cutoff_<val>_peaks_idr.narrowpeak-- This file is generated by the IDR pipeline, and it contains the merged peaks for the indicated replicates. A file like this will be generated for each pair-wise comparison of replicates. Information in the file includes the IDR calculated for each peak. The information in this file is used to determine which peaks pass the IDR threshold in each pair-wise comparison of replicates, and that information is subsequently used to inform which peaks to write to the*_idr_passed.narrowpeakfile above.{OUTPREFIX}_CHIP_rep2_cutoff_<val>_peaks_vs_{OUTPREFIX}_CHIP_rep1_cutoff_<val>_peaks_idr.narrowpeak.png-- This image contains four plots that can be useful in diagnosing potential issues that arise in IDR calculation, and are helpful in setting an appropriate IDR threshold.{OUTPREFIX}_CHIP_rep1_cutoff_<val>_peaks.narrowpeak-- The peaks identified in replicate 1. A file like this will be generated for each replicate.{OUTPREFIX}_CHIP_cutoff_<val>_idr_passed.narrowpeak-- This file contains only the peaks which had an IDR below the threshold denoted byidr -> thresholdin the main configuration file.{OUTPREFIX}_IPOD_rzchipsub_rep2_cutoff_<val>_peaks_vs_{OUTPREFIX}_IPOD_rzchipsub_rep1_cutoff_<val>_peaks_idr.narrowpeak-- This file is generated by the IDR pipeline, and it contains the merged peaks for the indicated replicates. A file like this will be generated for each pair-wise comparison of replicates. Information in the file includes the IDR calculated for each peak. The information in this file is used to determine which peaks pass the IDR threshold in each pair-wise comparison of replicates, and that information is subsequently used to inform which peaks to write to the*_idr_passed.narrowpeakfile above.{OUTPREFIX}_IPOD_rzchipsub_rep2_cutoff_<val>_peaks_vs_{OUTPREFIX}_IPOD_rzchipsub_rep1_cutoff_<val>_peaks_idr.narrowpeak.png-- This image contains four plots that can be useful in diagnosing potential issues that arise in IDR calculation, and are helpful in setting an appropriate IDR threshold.
Files saved during EPOD calling are written to the location specified by
epods -> output_path in the main configuration file. Note that if you
did not subtract the RNAP contribution to your IPOD signal, "rzchipsub"
will be substituted with some other value. Also, for each file with
"rep1" in its name you should expect to see one additional file for each
replicate's EPODS. The files include:
- Paired data
-
{OUTPREFIX}_IPOD_rzchipsub_rep1_epods_loose.narrowpeak-- Each line represents an EPOD called under the "loose" conditions specified in your main configuration file. -
{OUTPREFIX}_IPOD_rzchipsub_rep1_epods_strict.narrowpeak-- Each line represents an EPOD called under the "strict" conditions specified in your main configuration file. -
{OUTPREFIX}_IPOD_rzchipsub_rep1_median256.bedgraph-- Contains values from convolving a 256-bp-wide rolling median over the signal of interest. These values are used to determine an appropriate threshold above which to call EPODs in the data in the next file below. -
{OUTPREFIX}_IPOD_rzchipsub_rep1_median512.bedgraph-- Values arrived at by convolving a 512-bp-wide rolling median over the signal of interest. -
{OUTPREFIX}_IPOD_rzchipsub_loose_merged_epods.narrowpeak-- This will often be a major file of interest in EPOD calling, as its contents will be used to filter EPODs based on how well-reproduced they are across replicates. This file contains the results of merging EPODS from all replicates.- Column 7 is the weighted mean signal for each merged EPOD. Here, the weighting is proportional to each replicate's EPODs' contributions to the length of the merged EPOD. Note that in a genome browser like IGB, column 7 will be called the "signal" for each EPOD.
- Column 8 is a simple fraction of replicates in which at least some portion of each merged EPOD was represented. Note that in most genome browsers, certainly in IGB, column 8 is called the "p-value". However, the numbers in column 8 are NOT p-values.
- Column 9 -- NOTE: This is usually the column of interest for filtering EPODs based on their reproducibility across replicates. Column 9 is a measure of how well-represented this EPOD was across all replicates. It is essentially the EPOD's mean representation. Consider a case of having two replicates: if an EPOD was called in only a single replicate, column 9 would be 0.5. In other words, for all the genome positions covered by this EPOD, half of the replicates supported the EPODs existence. Now consider a genomic locus in which an EPOD was called in replicate 1 AND 2. However, for replicate 2, although the EPOD was contained enirely within replicate 1's EPOD call, replicate 2's EPOD call was exactly half the the length of the EPOD from replicate 1. In this case, column 9 would be 0.75, because on average across replicates, 75% of the positions in this merged EPOD were called an epod in all replicates. Note that in most genome browsers, certainly in IGB, column 9 is called the "q-value". However, column 9 in this file in NOT a q-value, and MUST NOT be interpreted as a q-value.
-
{OUTPREFIX}_IPOD_rzchipsub_strict_merged_epods.narrowpeak-- This will often be a major file of interest in EPOD calling, as its contents will be used to filter EPODs based on how well-reproduced they are across replicates. See the description for the file above.
-
- Both paired and unpaired data
{OUTPREFIX}_IPOD_rzchipsub_mean_epods_loose.narrowpeak-- The result of EPOD calling on the estimate of the mean signal at the "loose" threshold.{OUTPREFIX}_IPOD_rzchipsub_mean_epods_strict.narrowpeak-- The result of EPOD calling on the estimate of the mean signal at the "strict" threshold.
To provide a test case that illustrates the full set of inputs needed to apply
the IPOD-HR analysis tools in a nontrivial case,
we distribute a bundle referred to as IPODHR_MINIMAL.tbz which contains
the complete directory structure and raw reads needed to process an IPOD-HR experiment;
in this case, all data are taken from WT MG1655 growing in log phase in either
rich or minimal media. The full test data set can be downloaded here.
Users will find examples of all required configuration/input files,
and can also run the complete analysis of the test data set by entering the example
data directory and calling.
python {SRC_PATH}/drivers/run_all_driver.py main.conf
Where {SRC_PATH} indicates the location of the analysis code distributed here.
NOTE: This section requires updates to reflect incorporation of peak and epod calling, and some major updates regarding bootstrapping and spike-in normalization
Included in our sample data distribution are gold standard files for the final
results generated by running the pipeline, obtained using our development environment.
The results obtained from a test run by running make diff,
which will display the magnitude of the differences between the files generated
in your run, and those present in the gold standard files.
Both the RMSD and MAX differences should be zero if the software environment
has been appropriately reproduced.
Note that while the data sets used in this test case are relatively small, they still are designed to provide a non-trivial working example, and will likely take several hours to run on a decently powerful workstation.
We also include in this source code distribution the python programs needed for key postprocessing tasks from the accompamying manuscripts, namely those used for calling IPOD peaks and calling extended protein occupancy domains (EPODs). Documentation for these tools is included in the postprocessing.md file.
Follow the instructions at the Apptainer site to install Apptainer on your Linux system. Once Apptainer is installed, you will also need to add the Sylabs remote endpoint as a repository to Apptainer by running the following:
apptainer remote add SylabsCloud cloud.sycloud.io
apptainer remote use SylabsCloudYou should now be able to pull the IPOD analysis container from
the Sylabs repository as follows, replacing </container/path>
with the location on your system where you would like the container
to reside:
cd </container/path>
apptainer pull ipod.sif library://schroedj/appliances/ipodNow verify the container using apptainer verify </container/path>/ipod.sif.
The container file includes all the necessary components to run our IPOD pipeline. We recommend you also download the example data to test the container.
To test the downloaded sif file on our provided dataset, enter the top-level directory for the example data. Now run the container by running the following at the command line, substituting the absolute path to the directory containing your bowtie2 index for <path/to/ref/direc>:
cd <top-level-directory>
apptainer run \
-B $(pwd):/ipod_data \
-B <path/to/ref/direc>:/ref:ro \
-B /run/shm \
-B <path/to/raw/data/direc>:<path/to/raw/data/direc>:ro \
<container/path/ipod.sif>In the above lines of code, substitute your top-level directory for
<top-level-directory> to move into the top-level directory for your
experiment. $(pwd) represents your current working directory,
so you must first enter your top-level-directory for this to run as expected.
If your host operating system is an older version of Ubuntu, for python's
multiprocessing module to work properly within the container,
-B /run/shm must also be included.
On many systems -B /run/shm can likely be omitted.
-B <path/to/raw/data/direc>:<path/to/raw/data/direc>:ro
is only necessary if your raw directories
within each sample directory are going to be symbolic links to the actual
directory containing your raw data. If you have your actual raw data
in the raw directories, as drawn in the directory tree above, this line
should not be included. Note that when you are symlinking to
<path/to/raw/data/direc>, the pipeline will set up the symlinks for you,
provided you have appropriately set the location to which the symlinks
point in the main configuration file's rawdir option (see here
for the documentation for this option).
At this point, if your code ran properly, your command prompt
should look something like (ipod) [ipod_<version>]$, your top-level-directory
will be located at /ipod_data, and the source code tree is located
at /src_for_distrib. To run the pipeline, run:
python /src_for_distrib/drivers/run_all_driver.py /ipod_data/main.conf > /ipod_data/<date>_run.log 2> /ipod_data/<date>_run.errAfter this first major step of the pipeline has completed running, you may want to call peaks and epods (see here for more information regarding peak and epod calling), which can be done within the Apptainer conatiner if the following manner:
python /src_for_distrib/drivers/do_peak_and_epod_calls.py /ipod_data/main.conf > /ipod_data/<date>_epods.log 2> /ipod_data/<date>_epods.errOnce this step is complete, the results should be checked using the procedure described above.