Analysis and code of HiC data for 16p samples
Table of Contents
- File Structure
- Methods
- Reproducing Results
File structure of all input data, reference files and results
All filepaths below are relative to the project root on our server: /data/talkowski/Samples/16p_HiC/
Files for reproducibility
$ tree -L 2 reference.files/ distiller-nf/
├── distiller-nf/
└── reference.files/
├── conda.envs
│ ├── cooltools.env.yml
│ ├── distiller.env.yml
│ ├── HiCRep.env.yml
│ ├── multiqc.env.yml
│ ├── qc3C.env.yml
│ └── TADLib.env.yml
├── packages
│ ├── functionsdchic_1.0.tar.gz
│ └── hashmap_0.2.2.tar.gz
└── README.md
Notice that all files contain a SampleID specifying which library the results are.
Some files contain a pair of SampleIDs since they are comparing 2 samples e.g. HiCRep.
All SampleIDs are follow the same format format Project.CellType.Genotype.BioRepID.TechRepID e.g. 16p.NSC.DEL.A3.TR1
16p: project this sample is a part ofNSC: Celltype {NSC, iN}DEL: Genotype of the sample for the region/gene of interest {WT,DEL,DUP}A3: ID string specifiying which biological replicate the sample isTR1: ID string specifiyng which technical replicate the sample is
$ tree /data/talkowski/Samples/16p_HiC/
./
├── reference.files/
│ └── genome.reference
│ ├── GRCh38_no_alt_analysis_set_GCA_000001405.15.chrom.sizes
│ ├── GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
│ ├── GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.amb
│ ├── GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.ann
│ ├── GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.bwt
│ ├── GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai
│ ├── GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.pac
│ └── GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.sa
├── HiC.16p.sample.metadata.tsv # metadata for all HiC samples
├── fastq/ # Raw reads for HiC samples
│ ├── 22LCC2LT4_3_2148261314_16pDELA3NSCHiC_S1_L003_R1_001.fastq.gz
│ ├── 22LCC2LT4_3_2148261314_16pDELA3NSCHiC_S1_L003_R2_001.fastq.gz
│ └── *.fastq.gz
└── sample.configs/ # Config files for distiller
├── template.distiller.yml # defining distiler-nf params
├── 16p.NSC.DEL.A3.TR1.distiller.yml # Template with specific fastq files
└── *.distiller.yml
Annotations of genomic features, used for association analyses
$ tree -L 1 /data/talkowski/Samples/16p_HiC/reference.files/
reference.files/
├── gencode.v38.annotation.gtf.gz # Gene position + info annotations
├── hg38_gene_deserts.tsv # locations of gene deserts
├── hg38.segdups.bed.gz # locations of Segmental Duplications
├── hg38.segdups.bed.gz.tbi
├── CTCF.annotations.tsv # Annotated CTCF sites
├── ENCODE.v4.cCRE.anontations.tsv # ENCODE cCREs annotations
├── gene.constraints.CNVR.tsv # Constraint metrics for each gene
└── genome.tracks/ # Several set metrics, computed bin-wise across the genome
├── ???
└── track.type_genecov/
The files used as input or that were generated by the distiller-nf are listed below
NOTE: ... indicates that these files exist for all samples/pairs, truncated for brevity
$ tree -L 1 /data/talkowski/Samples/16p_HiC/results/ -P "*_library|mapped_*|*qc"
└── results/
├── sample.QC/
│ ├── multiqc.reports/
│ │ ├── fastp.multiqc.html
│ │ ├── fastqc.multiqc.html
│ │ ├── pairtools.multiqc.html
│ │ └── qc3C.multiqc.html
├── coolers_library/
│ ├── 16p.NSC.DEL.A3.TR1/
│ │ ├── 16p.NSC.DEL.A3.TR1.hg38.mapq_30.1000.cool
│ │ ├── 16p.NSC.DEL.A3.TR1.hg38.mapq_30.1000.mcool
│ │ ├── 16p.NSC.DEL.A3.TR1.hg38.no_filter.1000.cool
│ │ └── 16p.NSC.DEL.A3.TR1.hg38.no_filter.1000.mcool
│ └── .../
├── fastqc/
│ ├── 16p.NSC.DEL.A3.TR1/
│ │ ├── 16p.NSC.DEL.A3.TR1.lane1.0.2_fastqc.html
│ │ ├── 16p.NSC.DEL.A3.TR1.lane1.0.2_fastqc.zip
│ │ ├── 16p.NSC.DEL.A3.TR1.lane1.0.1_fastqc.html
│ │ └── 16p.NSC.DEL.A3.TR1.lane1.0.1_fastqc.zip
│ └── .../
├── mapped_parsed_sorted_chunks
│ ├── 16p.NSC.DEL.A3.TR1/
│ │ ├── 16p.NSC.DEL.A3.TR1.lane1.hg38.0.fastp.json
│ │ ├── 16p.NSC.DEL.A3.TR1.lane1.hg38.0.fastp.html
│ │ ├── 16p.NSC.DEL.A3.TR1.lane1.hg38.0.pairsam.gz
│ │ └── 16p.NSC.DEL.A3.TR1.lane1.hg38.0.bam
│ └── .../
└── pairs_library
├── 16p.NSC.DEL.A3.TR1/
│ ├── 16p.NSC.DEL.A3.TR1.hg38.dedup.stats
│ ├── 16p.NSC.DEL.A3.TR1.hg38.dups.bam
│ ├── 16p.NSC.DEL.A3.TR1.hg38.dups.pairs.gz
│ ├── 16p.NSC.DEL.A3.TR1.hg38.nodups.bam
│ ├── 16p.NSC.DEL.A3.TR1.hg38.nodups.pairs.gz
│ ├── 16p.NSC.DEL.A3.TR1.hg38.unmapped.bam
│ ├── 16p.NSC.DEL.A3.TR1.hg38.unmapped.pairs.gz
│ └── 16p.NSC.DEL.A3.TR1.hg38.nodups.pairs.gz.px2
└── .../
List of coallated results across annotation types + analyses
$ tree -L 1 /data/talkowski/Samples/16p_HiC/results/ -I "*_library|mapped_*|*qc"
results/
├── RNASeq
│ ├── all.DESeq2.results.tsv
│ ├── all.expression.data.tsv
│ ├── all.TRADE.results.tsv
│ ├── DESeq2
│ └── expression
├── sample.QC
│ ├── coverage
│ ├── expected.coverage
│ ├── minimum.viable.resolutions.tsv
│ ├── multiqc.reports
│ └── resolution.coverage.summaries.tsv
├── hicrep
│ ├── all.hicrep.cmds.txt
│ ├── all.hicrep.scores.tsv
│ └── results
├── weiner.replication
│ └── plots
├── TADs
│ ├── results_TADs
│ ├── all.ConsensusTAD.TADs.tsv
│ ├── all.hiTAD.TADs.tsv
│ ├── all.all.TADs.tsv
│ ├── all.all.TAD.MoCs.tsv
│ ├── all.TADCompare.results.tsv
│ ├── all.TADCompare.n.results.tsv
│ └── results_TADCompare
├── loops
│ ├── all.cooltools.loops.tsv
│ ├── filtered.cooltools.loops.tsv
│ ├── all.cooltools.IDR2D.results.tsv
│ ├── filtered.cooltools.IDR2D.results.tsv
│ ├── results_IDR2D
│ └── results_loops
├── multiHiCCompare
│ ├── all.multiHiCCompare.n.results.tsv
│ ├── all.multiHiCCompare.results.tsv
│ └── results
├── compartments
│ ├── all.cooltools.compartments.tsv
│ └── results_compartments
└── gghic.plots
└── plots
List all rmarkdown notebooks withsets of results
$ tree -L 1 /data/talkowski/Samples/16p_HiC/notebooks/ -P '*.html'
/data/talkowski/Samples/16p_HiC/notebooks/
├── matrix.QC.html
├── matrix.coverage.html
├── hicrep.html
├── weiner.replication.html
│ └── find /data/talkowski/Samples/16p_HiC/results/gghic.plots/plots/ -type f -name '*.pdf'
├── TADs.html
├── TADCompare.html
├── loops.html
├── loop.reproducibility.html
├── loop.integration.html
├── multiHiCCompare.html
└── HiC.heatmaps.html
└── find /data/talkowski/Samples/16p_HiC/results/weiner.replication/plots/ -type f -name '*.pdf'
Each sample as a .yml file (in ./sample.configs) specifiying the params for distiller-nf to run the sample with. We separte each sample into its own file so we can run them in parallel, but the only difference between files are the input fastq files, all pipeline parameters are the same.
We use 64 cores total, with 128 maxCPUs set in the nextflow config.
This fully processes a sample (.fastq -> .mcool) with ~400M reads in ~9h.
Use the tool qc3C github in bam mode to profile quality metrics for our HiC samples.
distiller-nf outputs multiple QC reports/files per sample than can each be aggregated into a single multiqc report docs. In the two previous steps we have also generated stats files which can be aggretaed via multiqc into their own reports. Ultimately we can generate 3 multiqc reports
fastqc: Quality statistics for readsfastp: Trimming+MAPQ statistics for filtered reads that were alignedqc3C: Quality statistics from sub-sampled aligned readspairtools stats: Summary statistics of processed HiC pairs
Two things we want to check to QC matrix samples
- pair frequency by distance
- Cis/Trans pair frequency
- minimum resolution as defined in Rao et al. 2014.
Metrics 1 and 2 are calcualted by distiller-nf and found in the *.dedup.stats files.
For metric 3 and subsequent plots we need to calculate the per-bin coverage using cooltools coverage .
For some subsequent steps we will analyze matrices formed by merging all biological/technical replicates for a given Edit+Genotype+CellType (e.g. 16p+WT+NSC). Merging her means summing all the total number of contacts over all samples for each bin-bin pair (matrix entry) and is handled by cooler merge. We create merged matrices for MAPQ filtered for each which produces the following files
results
└── coolers_library
├── 16p.NSC.WT.Merged.Merged/
│ ├── 16p.NSC.WT.Merged.Merged.hg38.mapq_30.1000.cool
│ └── 16p.NSC.WT.Merged.Merged.hg38.mapq_30.1000.mcool
├── 16p.NSC.DEL.Merged.Merged/
│ └── ...
└── ...
We use HiCRep to calculate the "reproducibility score" for all pairs of sample matrices, under several parameter combinations. The command below actually runs the HiCRep and produces 1 file per sample pair + parameter combination, each file contains scores for each chromosome separately (chr{1..22,X,Y}).
After this there is a .Rmd notebook that coallates these files into a single neat dataframe that is used for plotting.
We generate the TAD annotations we use 3 different programs:
For HiTAD we only generate single-level TADs (not hierarchical) using default parameters. This produces a set of bin-pairs, each pair marking the start and end of the predicted TAD.
Both of these tools also output the Diamond Insulation (DI) score calcualted per genomic bin. This can be used to score regions to compare the relative insulation of regions.
For 2 different TAD annotation sets (start/end pairs) of the same region (e.g. chr16) we can calculate the similarity of the 2 sets by computing the Measure of Concordance as defined in this paper.
We also compare TADs called from merged matrices using TADCompare, since each merged matrix represents a biological condition (e.g. 16p.iN.WT) it is an expliciit differential analysis
For 2 different TAD boundary annotation sets (just is/is not a boundary) of the same region we can calculate similarity as follows:
- Calculate the distance between all pairs of boundaries between the sets
- For each boundary in set 1 pick the boundary in set 2 with the samllest distance
- If there are any boundaries are set 2 that are the nearest to 2 different set 1 bounadaries, assign it to the closest set 1 boundary and assign the 2nd closest boundary in set 2 to the remaining set 1 boundary
- Using all the boundary-pair distance, summarize in at least 1 of the following ways
- Test if the distribution is > 0 (KS-test vs Gaussian + observed variance) -> need to correct p-values since there are many tests (1 per pair of TAD annotation set)
- Calculate mean distance and just compare that between pairs
We call loops with cooltools dots --smooth --aggregate-smoothed using the ICE-balanced contact matrices, with all default parameters for all merged matrices @ 25,10,5Kb resolutions.
For all subsequent analyses we only use loops with a qvalue < 0.1 with the donut kernel.
We evaluate wheter a loop is replicated or not between two conditions using the IDR2D R package.
We rank loops by their -log10(q.value), and we resolve ambiguous overlaps by the picking the most significant ('ambiguity_resolution_method='value') and we allow +/- 5 bins offset when matching loops to compare (max_gap = 5 * resolution).
We use the multiHiCCompare R package to look for individual difference in bin-pair contact frequency between conditions.
We use the fastlo() implementation of the cyclic LOESS normalizaiton and we use the function hic_exactTest() since we are only every comparing a single binary condition (e.g. DEL vs WT).
We calulate results separately for every resolution (@ 100,50,25,10,5Kb) + comparison + chromosome, and pool results genome-wide (i.e. by resolution by comparison) to apply BH adjustment to the raw p-values.
We call compartments using cooltools eigs-cis, and phase the calls using genecov track, compuated by cooltools genome genecov, using all default params @ 100,50,25,10,5Kb.
Different filler text
There are several dependencies, including shell tools, python and R libraries Shell Tools from conda environments
# cooler + cooltools
conda env create -f cooltools.env.yml
# distiller-nf + dependencies
conda env create -f distiller.env.yml
# python HiCRep
conda env create -f HiCRep.env.yml
# multiQC
conda env create -f multiqc.env.yml
# hiTAD
conda env create -f TADLib.env.ymlR dependencies
# Basic data handling dependencies + plotting stuff
install.packages(
c(
'ComplexUpset',
'cowplot',
'GGally',
'ggh4x',
'ggnewscale',
'ggplot2',
'ggpubr',
'patchwork',
'scales',
'viridis',
'dplyr',
'glue',
'magrittr',
'purrr',
'stringi',
'tibble',
'tidyverse',
'furrr',
'future',
'here',
'optparse',
'tictoc',
'knitr',
'kableExtra',
'gtable',
)
)
# Bioconductor based dependencies + biologically specific tools
install.packages(
c(
'plyranges',
'idr2d',
'hictkR'
)
)
devtools::install_github("jasonwong-lab/gghic")
devtools::install_github("ajaynadig/TRADEtools")
BiocManager::install(
c(
'AnnotationHub',
'BiocParallel',
'GenomicRanges',
'InteractionSet',
'HiContacts',
'TADCompare',
'multiHiCcompare'
)
)Generate .mcool files from .fastq files using the distiller-nf pipeline
# Generate individual yml files each sample for the distiller-nf pipeline using a template
# Variables in the template are defined in locations.R or in make.distiller.configs.R
Rscript ./scripts/distiller/make.distiller.configs.R ./sample.configs/template.distiller.yml
# Run the distiller-nf pipeline by submitting each sample config as an individual SLURM job
module load wget; module unload java; module load singularity/3.7.0; conda activate dist2
./scripts/distiller/run.distiller.sh -w "./work_${USER}" -a ~/miniforge3 ./sample.configs/16p.*.ymlGenerate various QC results from distiller-nf output files
# Generate qc3C report for each bam file
./scripts/matrix.utils.sh qc3C ./results/sample.QC/qc3C/ DpnII HinfI results/mapped_parsed_sorted_chunks/**/*.bam
# Generate multiQC reprots from fastqc and pairtools resutls
./scripts/matrix.utils.sh multiqcs ./results/sample.QC/multiqc.reports/ ./results/
# Generated merged matrices for each condition
./scripts/matrix.utils.sh merge_16p ./results/coolers_library
# Calculate total bin-wise coverage
./scripts/coverage/run.cooltools.coverage.sh -o ./results/sample.QC/coverage/ ./results/coolers_library/**/*.mapq_30.1000.mcoolGenerate HiCRep results
# Generate list of commands to run for all pairs of matrices + all hyper-param combos
conda activate hicrep
Rscript ./scripts/hicrep/run.hicrep.R && parallel -j $(nproc) --bar --eta :::: ./results/hicrep/all.hicrep.cmds.txtGenerate Weiner et al. 2022 replication analysis figures
Rscript ./scripts/weiner.replication/make.replication.figures.RGenerate TAD and insulation annotations
# Generate TAD annotations with hiTAD
./scripts/TADs/run.TAD.Callers.sh -e inplace hiTAD ./results/coolers_library/**/*.Merged.Merged*.mapq_30.1000.mcool
# Generate TAD boundary annotations with cool
# ./scripts/TADs/run.TAD.Callers.sh -e inplace cooltools ./results/coolers_library/**/*.Merged.Merged*.mapq_30.1000.mcool
# Generate Consensus TAD results from set of individual matrices with spectralTAD
Rscript ./scripts/TADs/run.ConsensusTADs.RGenerate TADCompare results
# Run TADCompare to generated differential TAD results
# requires 120Gb for the largest matrix comparison (i.e. chr1 @5Kb)
Rscript ./scripts/TADs/run.TADCompare.RGenerate Loop results
# generate loop annotations with cooltools + all default params
./scripts/loops/run.loops.cooltools.sh ./results/coolers_library/**/*.Merged.Merged*.mapq_30.1000.mcool
# Use IDR2D to define which loops are reproducible between conditions
Rscript scripts/loops/run.IDR2D.loops.R
# Map loops + reproducibility to gene coordinates
# Rscript scripts/loops/map.loops.to.genes.RGenerate multiHiCCompare results
Rscript ./scripts/DifferentialContacts/run.multiHiCCompare.RGenerate Compartment annotations
./scripts/compartments/run.compartments.cooltools.sh -m mk_phase
./scripts/compartments/run.compartments.cooltools.sh -m compartments ./results/coolers_library/**/*.Merged.Merged*.mapq_30.1000.mcoolMake annotated HiC heatmaps w/gghic
Rscript ./scripts/gghic.plots/plot.annotated.contact.heatmaps.RHere is a convenient function to compile rmarkdown notebooks in bash
Put this in your .bashrc to use it
knit() {
input_rmd="$(readlink -e "${1}")"
if [[ -z "${2}" ]]; then
self_contained="FALSE"
else
self_contained="TRUE"
fi
echo "self_contained: ${self_contained}"
output_html="${input_rmd%.Rmd}.html"
R -q -e "
library(rmarkdown)
rmarkdown::render(
input='${input_rmd}',
output_file='${output_html}',
output_dir=NULL,
output_format=
rmdformats::html_clean(
code_folding='hide',
df_print='paged',
self_contained=${self_contained},
lightbox=TRUE,
gallery=TRUE,
toc_depth=5,
thumbnails=FALSE
)
)"
}Before that, we want to coallate individually generated results files into a single large .tsv for each analysis, so you can run
Rscript scripts/coallate.results.RWe also want to separately generate figures for notebooks that will be inserted into notebooks later
Rscript ./scripts/plot.gghic/plot.annotated.HiC.heatmaps.R
Rscript ./scripts/weiner.replication/make.replication.figures.RNow we can compute each notebooks
knit Matrix.QC.Rmd
knit Matrix.Coverage.Rmd
knit HiCRep.Rmd
knit Weiner.Replication.Rmd
knit TADs.Rmd
knit TADCompare.Rmd
knit Loops.Rmd
knit Loop.Reproducibility.Rmd
knit Loop.Integration.Rmd
knit multiHiCCompare.Rmd
knit HiC.Heatmaps.Rmd# List all matrices
find results/coolers_library/ -type f -name '*mapq_30.1000.mcool' | rev | cut -d'/' -f1 | rev | cut -d'.' -f-3,7 | sort | uniq -c | column -s'\.' -t
# List the marginal coverage results for all samples
find results/sample.QC/coverage/ -type f | sed -e 's/-coverage.tsv//' | cut -d'/' -f4-5 | sort | uniq -c | column -s'/' -t
# List all HiCRep results
find results/hicrep/ -type f -name '*.txt' | sed -e 's/-hicrep.txt//' | cut -d'/' -f3-6 | sort | uniq -c | column -s'/' -t
# List all cooltools insulations results
# find results/TADs/results_TADs/method_cooltools/ -type f -name '*-TAD.tsv' | cut -d'/' -f4-7 | sed -e 's/-TAD.tsv//' | sort | uniq -c | column -s'/' -t
# List all hiTAD results
find results/TADs/results_TADs/method_hiTAD/ -type f -name '*-TAD.tsv' | cut -d'/' -f4-6 | sed -e 's/-TAD.tsv//' | sort | uniq -c | column -s'/' -t
# List all ConsensusTAD results
find results/TADs/results_TADs/method_ConsensusTAD/ -type f | cut -d'/' -f4-8,10- | sed -e 's/-ConsensusTADs.tsv//' | sort | uniq -c | column -s'/' -t
# List all TADComapre results
find results/TADs/results_TADCompare/ -type f -name '*-TADCompare.tsv' | sed -e 's/-TADCompare.tsv//' | sed -e 's/_vs_/\//' | cut -d'/' -f4-7,9- | sort | uniq -c | column -s'/' -t
# List all loop results
find results/loops/results_loops/method_cooltools -type f -name '*-dots.tsv' | cut -d'/' -f4-6 | sort | uniq -c | column -s'/' -t
# IDR2D results
find results/loops/results_IDR2D -type f -name '*.tsv' | cut -d'/' -f4-6 | sort | uniq -c | column -s'/' -t
# List all multiHiCCompare results
find results/multiHiCCompare/results/ -type f -name '*-multiHiCCompare.tsv' | sed -s 's/-multiHiCCompare.tsv//' | cut -d'/' -f4-6,8 | sort | uniq -c | column -s'/' -t
# List compartment results
find results/compartments/results_compartments/ -type f -name '*-cis.vecs.tsv' | sed -s 's/-cis.vecs.tsv//' | cut -d'/' -f4- | sort | uniq -c | column -s'/' -t