Skip to content

talkowski-lab/16p_HiC

Repository files navigation

16p HiC Analysis

Analysis and code of HiC data for 16p samples

Table of Contents

File Structure

File structure of all input data, reference files and results

All filepaths below are relative to the project root on our server: /data/talkowski/Samples/16p_HiC/

Code Reproducibility

Files for reproducibility

$ tree -L 2 reference.files/ distiller-nf/
├── distiller-nf/
└── reference.files/
    ├── conda.envs
    │   ├── cooltools.env.yml
    │   ├── distiller.env.yml
    │   ├── HiCRep.env.yml
    │   ├── multiqc.env.yml
    │   ├── qc3C.env.yml
    │   └── TADLib.env.yml
    ├── packages
    │   ├── functionsdchic_1.0.tar.gz
    │   └── hashmap_0.2.2.tar.gz
    └── README.md

Input Data

Notice that all files contain a SampleID specifying which library the results are. Some files contain a pair of SampleIDs since they are comparing 2 samples e.g. HiCRep. All SampleIDs are follow the same format format Project.CellType.Genotype.BioRepID.TechRepID e.g. 16p.NSC.DEL.A3.TR1

  • 16p: project this sample is a part of
  • NSC: Celltype {NSC, iN}
  • DEL: Genotype of the sample for the region/gene of interest {WT,DEL,DUP}
  • A3: ID string specifiying which biological replicate the sample is
  • TR1: ID string specifiyng which technical replicate the sample is
$ tree /data/talkowski/Samples/16p_HiC/
./
├── reference.files/
│   └── genome.reference
│       ├── GRCh38_no_alt_analysis_set_GCA_000001405.15.chrom.sizes
│       ├── GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
│       ├── GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.amb
│       ├── GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.ann
│       ├── GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.bwt
│       ├── GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai
│       ├── GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.pac
│       └── GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.sa
├── HiC.16p.sample.metadata.tsv                                       # metadata for all HiC samples
├── fastq/                                                            # Raw reads for HiC samples
│   ├── 22LCC2LT4_3_2148261314_16pDELA3NSCHiC_S1_L003_R1_001.fastq.gz
│   ├── 22LCC2LT4_3_2148261314_16pDELA3NSCHiC_S1_L003_R2_001.fastq.gz
│   └── *.fastq.gz
└── sample.configs/                                                   # Config files for distiller
    ├── template.distiller.yml                                        # defining distiler-nf params
    ├── 16p.NSC.DEL.A3.TR1.distiller.yml                              # Template with specific fastq files
    └── *.distiller.yml

Genome Annotation Files

Annotations of genomic features, used for association analyses

$ tree -L 1 /data/talkowski/Samples/16p_HiC/reference.files/
reference.files/
├── gencode.v38.annotation.gtf.gz    # Gene position + info annotations
├── hg38_gene_deserts.tsv            # locations of gene deserts
├── hg38.segdups.bed.gz              # locations of Segmental Duplications
├── hg38.segdups.bed.gz.tbi
├── CTCF.annotations.tsv             # Annotated CTCF sites
├── ENCODE.v4.cCRE.anontations.tsv   # ENCODE cCREs annotations
├── gene.constraints.CNVR.tsv        # Constraint metrics for each gene
└── genome.tracks/                   # Several set metrics, computed bin-wise across the genome
    ├── ???
    └── track.type_genecov/

Results Files

The files used as input or that were generated by the distiller-nf are listed below NOTE: ... indicates that these files exist for all samples/pairs, truncated for brevity

$ tree -L 1 /data/talkowski/Samples/16p_HiC/results/ -P "*_library|mapped_*|*qc"
└── results/
    ├── sample.QC/
    │   ├── multiqc.reports/
    │   │   ├── fastp.multiqc.html
    │   │   ├── fastqc.multiqc.html
    │   │   ├── pairtools.multiqc.html
    │   │   └── qc3C.multiqc.html
    ├── coolers_library/
    │   ├── 16p.NSC.DEL.A3.TR1/
    │   │   ├── 16p.NSC.DEL.A3.TR1.hg38.mapq_30.1000.cool
    │   │   ├── 16p.NSC.DEL.A3.TR1.hg38.mapq_30.1000.mcool
    │   │   ├── 16p.NSC.DEL.A3.TR1.hg38.no_filter.1000.cool
    │   │   └── 16p.NSC.DEL.A3.TR1.hg38.no_filter.1000.mcool
    │   └── .../
    ├── fastqc/
    │   ├── 16p.NSC.DEL.A3.TR1/
    │   │   ├── 16p.NSC.DEL.A3.TR1.lane1.0.2_fastqc.html
    │   │   ├── 16p.NSC.DEL.A3.TR1.lane1.0.2_fastqc.zip
    │   │   ├── 16p.NSC.DEL.A3.TR1.lane1.0.1_fastqc.html
    │   │   └── 16p.NSC.DEL.A3.TR1.lane1.0.1_fastqc.zip
    │   └── .../
    ├── mapped_parsed_sorted_chunks
    │   ├── 16p.NSC.DEL.A3.TR1/
    │   │   ├── 16p.NSC.DEL.A3.TR1.lane1.hg38.0.fastp.json
    │   │   ├── 16p.NSC.DEL.A3.TR1.lane1.hg38.0.fastp.html
    │   │   ├── 16p.NSC.DEL.A3.TR1.lane1.hg38.0.pairsam.gz
    │   │   └── 16p.NSC.DEL.A3.TR1.lane1.hg38.0.bam
    │   └── .../
    └── pairs_library
        ├── 16p.NSC.DEL.A3.TR1/
        │   ├── 16p.NSC.DEL.A3.TR1.hg38.dedup.stats
        │   ├── 16p.NSC.DEL.A3.TR1.hg38.dups.bam
        │   ├── 16p.NSC.DEL.A3.TR1.hg38.dups.pairs.gz
        │   ├── 16p.NSC.DEL.A3.TR1.hg38.nodups.bam
        │   ├── 16p.NSC.DEL.A3.TR1.hg38.nodups.pairs.gz
        │   ├── 16p.NSC.DEL.A3.TR1.hg38.unmapped.bam
        │   ├── 16p.NSC.DEL.A3.TR1.hg38.unmapped.pairs.gz
        │   └── 16p.NSC.DEL.A3.TR1.hg38.nodups.pairs.gz.px2
        └── .../

List of coallated results across annotation types + analyses

$ tree -L 1 /data/talkowski/Samples/16p_HiC/results/ -I "*_library|mapped_*|*qc"
results/
├── RNASeq
│   ├── all.DESeq2.results.tsv
│   ├── all.expression.data.tsv
│   ├── all.TRADE.results.tsv
│   ├── DESeq2
│   └── expression
├── sample.QC
│   ├── coverage
│   ├── expected.coverage
│   ├── minimum.viable.resolutions.tsv
│   ├── multiqc.reports
│   └── resolution.coverage.summaries.tsv
├── hicrep
│   ├── all.hicrep.cmds.txt
│   ├── all.hicrep.scores.tsv
│   └── results
├── weiner.replication
│   └── plots
├── TADs
│   ├── results_TADs
│   ├── all.ConsensusTAD.TADs.tsv
│   ├── all.hiTAD.TADs.tsv
│   ├── all.all.TADs.tsv
│   ├── all.all.TAD.MoCs.tsv
│   ├── all.TADCompare.results.tsv
│   ├── all.TADCompare.n.results.tsv
│   └── results_TADCompare
├── loops
│   ├── all.cooltools.loops.tsv
│   ├── filtered.cooltools.loops.tsv
│   ├── all.cooltools.IDR2D.results.tsv
│   ├── filtered.cooltools.IDR2D.results.tsv
│   ├── results_IDR2D
│   └── results_loops
├── multiHiCCompare
│   ├── all.multiHiCCompare.n.results.tsv
│   ├── all.multiHiCCompare.results.tsv
│   └── results
├── compartments
│   ├── all.cooltools.compartments.tsv
│   └── results_compartments
└── gghic.plots 
    └── plots

Notebooks

List all rmarkdown notebooks withsets of results

$ tree -L 1 /data/talkowski/Samples/16p_HiC/notebooks/ -P '*.html'
/data/talkowski/Samples/16p_HiC/notebooks/
├── matrix.QC.html
├── matrix.coverage.html
├── hicrep.html
├── weiner.replication.html
│   └── find /data/talkowski/Samples/16p_HiC/results/gghic.plots/plots/ -type f -name '*.pdf'
├── TADs.html
├── TADCompare.html
├── loops.html
├── loop.reproducibility.html
├── loop.integration.html
├── multiHiCCompare.html
└── HiC.heatmaps.html
    └── find /data/talkowski/Samples/16p_HiC/results/weiner.replication/plots/ -type f -name '*.pdf'

Methods

Running distiller pipeline

Each sample as a .yml file (in ./sample.configs) specifiying the params for distiller-nf to run the sample with. We separte each sample into its own file so we can run them in parallel, but the only difference between files are the input fastq files, all pipeline parameters are the same. We use 64 cores total, with 128 maxCPUs set in the nextflow config. This fully processes a sample (.fastq -> .mcool) with ~400M reads in ~9h.

Running qc3C profiling

Use the tool qc3C github in bam mode to profile quality metrics for our HiC samples.

Generate MultiQC reports

distiller-nf outputs multiple QC reports/files per sample than can each be aggregated into a single multiqc report docs. In the two previous steps we have also generated stats files which can be aggretaed via multiqc into their own reports. Ultimately we can generate 3 multiqc reports

  • fastqc: Quality statistics for reads
  • fastp: Trimming+MAPQ statistics for filtered reads that were aligned
  • qc3C: Quality statistics from sub-sampled aligned reads
  • pairtools stats: Summary statistics of processed HiC pairs

Calculate QC Metrics

Two things we want to check to QC matrix samples

  1. pair frequency by distance
  2. Cis/Trans pair frequency
  3. minimum resolution as defined in Rao et al. 2014.

Metrics 1 and 2 are calcualted by distiller-nf and found in the *.dedup.stats files. For metric 3 and subsequent plots we need to calculate the per-bin coverage using cooltools coverage .

Merging Matrices

For some subsequent steps we will analyze matrices formed by merging all biological/technical replicates for a given Edit+Genotype+CellType (e.g. 16p+WT+NSC). Merging her means summing all the total number of contacts over all samples for each bin-bin pair (matrix entry) and is handled by cooler merge. We create merged matrices for MAPQ filtered for each which produces the following files

results 
└── coolers_library
    ├── 16p.NSC.WT.Merged.Merged/
    │   ├── 16p.NSC.WT.Merged.Merged.hg38.mapq_30.1000.cool
    │   └── 16p.NSC.WT.Merged.Merged.hg38.mapq_30.1000.mcool
    ├── 16p.NSC.DEL.Merged.Merged/
    │   └── ...
    └── ...

HiCRep Analysis

We use HiCRep to calculate the "reproducibility score" for all pairs of sample matrices, under several parameter combinations. The command below actually runs the HiCRep and produces 1 file per sample pair + parameter combination, each file contains scores for each chromosome separately (chr{1..22,X,Y}). After this there is a .Rmd notebook that coallates these files into a single neat dataframe that is used for plotting.

TAD Analysis

TAD Annotation

We generate the TAD annotations we use 3 different programs:

  1. HiTAD
  2. TADCompare consensusTAD()

For HiTAD we only generate single-level TADs (not hierarchical) using default parameters. This produces a set of bin-pairs, each pair marking the start and end of the predicted TAD.

Both of these tools also output the Diamond Insulation (DI) score calcualted per genomic bin. This can be used to score regions to compare the relative insulation of regions.

TAD Comparison

For 2 different TAD annotation sets (start/end pairs) of the same region (e.g. chr16) we can calculate the similarity of the 2 sets by computing the Measure of Concordance as defined in this paper.

We also compare TADs called from merged matrices using TADCompare, since each merged matrix represents a biological condition (e.g. 16p.iN.WT) it is an expliciit differential analysis

For 2 different TAD boundary annotation sets (just is/is not a boundary) of the same region we can calculate similarity as follows:

  1. Calculate the distance between all pairs of boundaries between the sets
  2. For each boundary in set 1 pick the boundary in set 2 with the samllest distance
  3. If there are any boundaries are set 2 that are the nearest to 2 different set 1 bounadaries, assign it to the closest set 1 boundary and assign the 2nd closest boundary in set 2 to the remaining set 1 boundary
  4. Using all the boundary-pair distance, summarize in at least 1 of the following ways
    1. Test if the distribution is > 0 (KS-test vs Gaussian + observed variance) -> need to correct p-values since there are many tests (1 per pair of TAD annotation set)
    2. Calculate mean distance and just compare that between pairs

Loop Analysis

Loop Annotation

We call loops with cooltools dots --smooth --aggregate-smoothed using the ICE-balanced contact matrices, with all default parameters for all merged matrices @ 25,10,5Kb resolutions. For all subsequent analyses we only use loops with a qvalue < 0.1 with the donut kernel.

Loop Comparison

We evaluate wheter a loop is replicated or not between two conditions using the IDR2D R package. We rank loops by their -log10(q.value), and we resolve ambiguous overlaps by the picking the most significant ('ambiguity_resolution_method='value') and we allow +/- 5 bins offset when matching loops to compare (max_gap = 5 * resolution).

Differential Contact Analysis

We use the multiHiCCompare R package to look for individual difference in bin-pair contact frequency between conditions. We use the fastlo() implementation of the cyclic LOESS normalizaiton and we use the function hic_exactTest() since we are only every comparing a single binary condition (e.g. DEL vs WT). We calulate results separately for every resolution (@ 100,50,25,10,5Kb) + comparison + chromosome, and pool results genome-wide (i.e. by resolution by comparison) to apply BH adjustment to the raw p-values.

Compartment Analysis

Compartment Annotation

We call compartments using cooltools eigs-cis, and phase the calls using genecov track, compuated by cooltools genome genecov, using all default params @ 100,50,25,10,5Kb.

Compartment Comparison

Different filler text

Reproducing Results

Dependencies

There are several dependencies, including shell tools, python and R libraries Shell Tools from conda environments

# cooler + cooltools
conda env create -f cooltools.env.yml
# distiller-nf + dependencies
conda env create -f distiller.env.yml
# python HiCRep
conda env create -f HiCRep.env.yml
# multiQC
conda env create -f multiqc.env.yml
# hiTAD
conda env create -f TADLib.env.yml

R dependencies

# Basic data handling dependencies + plotting stuff
install.packages(
    c(
        'ComplexUpset',
        'cowplot',
        'GGally',
        'ggh4x',
        'ggnewscale',
        'ggplot2',
        'ggpubr',
        'patchwork',
        'scales',
        'viridis',

        'dplyr',
        'glue',
        'magrittr',
        'purrr',
        'stringi',
        'tibble',
        'tidyverse',

        'furrr',
        'future',

        'here',
        'optparse',
        'tictoc',
        'knitr',
        'kableExtra',
        'gtable',
    )
)

# Bioconductor based dependencies + biologically specific tools
install.packages(
    c(
        'plyranges',
        'idr2d',
        'hictkR'
    )
)
devtools::install_github("jasonwong-lab/gghic")
devtools::install_github("ajaynadig/TRADEtools")
BiocManager::install(
    c(
        'AnnotationHub',
        'BiocParallel',
        'GenomicRanges',
        'InteractionSet',
        'HiContacts',
        'TADCompare',
        'multiHiCcompare'
    )
)

Generate Results

Generate .mcool files from .fastq files using the distiller-nf pipeline

# Generate individual yml files each sample for the distiller-nf pipeline using a template
# Variables in the template are defined in locations.R or in make.distiller.configs.R
Rscript ./scripts/distiller/make.distiller.configs.R ./sample.configs/template.distiller.yml
# Run the distiller-nf pipeline by submitting each sample config as an individual SLURM job
module load wget; module unload java; module load singularity/3.7.0; conda activate dist2
./scripts/distiller/run.distiller.sh -w "./work_${USER}" -a ~/miniforge3 ./sample.configs/16p.*.yml

Generate various QC results from distiller-nf output files

# Generate qc3C report for each bam file
./scripts/matrix.utils.sh qc3C ./results/sample.QC/qc3C/ DpnII HinfI results/mapped_parsed_sorted_chunks/**/*.bam  
# Generate multiQC reprots from fastqc and pairtools resutls
./scripts/matrix.utils.sh multiqcs ./results/sample.QC/multiqc.reports/ ./results/
# Generated merged matrices for each condition
./scripts/matrix.utils.sh merge_16p ./results/coolers_library
# Calculate total bin-wise coverage 
./scripts/coverage/run.cooltools.coverage.sh -o ./results/sample.QC/coverage/ ./results/coolers_library/**/*.mapq_30.1000.mcool

Generate HiCRep results

# Generate list of commands to run for all pairs of matrices + all hyper-param combos
conda activate hicrep
Rscript ./scripts/hicrep/run.hicrep.R && parallel -j $(nproc) --bar --eta :::: ./results/hicrep/all.hicrep.cmds.txt

Generate Weiner et al. 2022 replication analysis figures

Rscript ./scripts/weiner.replication/make.replication.figures.R

Generate TAD and insulation annotations

# Generate TAD annotations with hiTAD 
./scripts/TADs/run.TAD.Callers.sh -e inplace hiTAD ./results/coolers_library/**/*.Merged.Merged*.mapq_30.1000.mcool
# Generate TAD boundary annotations with cool
# ./scripts/TADs/run.TAD.Callers.sh -e inplace cooltools ./results/coolers_library/**/*.Merged.Merged*.mapq_30.1000.mcool
# Generate Consensus TAD results from set of individual matrices with spectralTAD
Rscript ./scripts/TADs/run.ConsensusTADs.R

Generate TADCompare results

# Run TADCompare to generated differential TAD results
# requires 120Gb for the largest matrix comparison (i.e. chr1 @5Kb)
Rscript ./scripts/TADs/run.TADCompare.R

Generate Loop results

# generate loop annotations with cooltools + all default params
./scripts/loops/run.loops.cooltools.sh ./results/coolers_library/**/*.Merged.Merged*.mapq_30.1000.mcool
# Use IDR2D to define which loops are reproducible between conditions
Rscript scripts/loops/run.IDR2D.loops.R
# Map loops + reproducibility to gene coordinates
# Rscript scripts/loops/map.loops.to.genes.R

Generate multiHiCCompare results

Rscript ./scripts/DifferentialContacts/run.multiHiCCompare.R

Generate Compartment annotations

./scripts/compartments/run.compartments.cooltools.sh -m mk_phase
./scripts/compartments/run.compartments.cooltools.sh -m compartments ./results/coolers_library/**/*.Merged.Merged*.mapq_30.1000.mcool

Make annotated HiC heatmaps w/gghic

Rscript ./scripts/gghic.plots/plot.annotated.contact.heatmaps.R

Compile Notebooks

Here is a convenient function to compile rmarkdown notebooks in bash Put this in your .bashrc to use it

knit() {
    input_rmd="$(readlink -e "${1}")"
    if [[ -z "${2}" ]]; then
        self_contained="FALSE"
    else 
        self_contained="TRUE"
    fi
    echo "self_contained: ${self_contained}"
    output_html="${input_rmd%.Rmd}.html"
    R -q -e "
    library(rmarkdown)
    rmarkdown::render(
        input='${input_rmd}',
        output_file='${output_html}',
        output_dir=NULL,
        output_format=
            rmdformats::html_clean(
                code_folding='hide',
                df_print='paged',
                self_contained=${self_contained},
                lightbox=TRUE,
                gallery=TRUE,
                toc_depth=5,
                thumbnails=FALSE
            )
    )"
}

Before that, we want to coallate individually generated results files into a single large .tsv for each analysis, so you can run

Rscript scripts/coallate.results.R

We also want to separately generate figures for notebooks that will be inserted into notebooks later

Rscript ./scripts/plot.gghic/plot.annotated.HiC.heatmaps.R
Rscript ./scripts/weiner.replication/make.replication.figures.R

Now we can compute each notebooks

knit Matrix.QC.Rmd
knit Matrix.Coverage.Rmd
knit HiCRep.Rmd
knit Weiner.Replication.Rmd
knit TADs.Rmd
knit TADCompare.Rmd
knit Loops.Rmd
knit Loop.Reproducibility.Rmd
knit Loop.Integration.Rmd
knit multiHiCCompare.Rmd
knit HiC.Heatmaps.Rmd

File Summary One-liners

# List all matrices
find results/coolers_library/ -type f -name '*mapq_30.1000.mcool' | rev | cut -d'/' -f1 | rev  | cut -d'.' -f-3,7 | sort | uniq -c | column -s'\.' -t
# List the marginal coverage results for all samples
find results/sample.QC/coverage/ -type f  | sed -e 's/-coverage.tsv//' | cut -d'/' -f4-5 | sort | uniq -c | column -s'/' -t
# List all HiCRep results
find results/hicrep/ -type f -name '*.txt' | sed -e 's/-hicrep.txt//' | cut -d'/' -f3-6 | sort | uniq -c | column -s'/' -t
# List all cooltools insulations results
# find results/TADs/results_TADs/method_cooltools/ -type f -name '*-TAD.tsv' | cut -d'/' -f4-7 | sed -e 's/-TAD.tsv//' | sort | uniq -c | column -s'/' -t
# List all hiTAD results
find results/TADs/results_TADs/method_hiTAD/ -type f -name '*-TAD.tsv' | cut -d'/' -f4-6 | sed -e 's/-TAD.tsv//' | sort | uniq -c | column -s'/' -t
# List all ConsensusTAD results
find results/TADs/results_TADs/method_ConsensusTAD/ -type f  | cut -d'/' -f4-8,10- | sed -e 's/-ConsensusTADs.tsv//' | sort | uniq -c | column -s'/' -t
# List all TADComapre results
find results/TADs/results_TADCompare/ -type f -name '*-TADCompare.tsv' | sed -e 's/-TADCompare.tsv//' | sed -e 's/_vs_/\//' | cut -d'/' -f4-7,9- | sort | uniq -c | column -s'/' -t
# List all loop results
find results/loops/results_loops/method_cooltools -type f -name '*-dots.tsv' | cut -d'/' -f4-6 | sort | uniq -c | column -s'/' -t
# IDR2D results
find results/loops/results_IDR2D -type f -name '*.tsv' | cut -d'/' -f4-6 | sort | uniq -c | column -s'/' -t
# List all multiHiCCompare results
find results/multiHiCCompare/results/ -type f -name '*-multiHiCCompare.tsv' | sed -s 's/-multiHiCCompare.tsv//' | cut -d'/' -f4-6,8 | sort | uniq -c | column -s'/' -t
# List compartment results
find results/compartments/results_compartments/ -type f -name '*-cis.vecs.tsv' | sed -s 's/-cis.vecs.tsv//' | cut -d'/' -f4- | sort | uniq -c | column -s'/' -t

About

Analysis and code of HiC data for 16p editied cell lines

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages