Code for the paper "Disease-Linked Regulatory DNA Variants and Homeostatic Transcription Factors in Epidermis" (2025). This code concerns allele-specific binding events (ASBs) in CUT&RUN.
This collection of snakemakes starts with bam/bigwig files and MACS2 peaks files as generated by https://github.com/khavarilab/cutnrun-snakemake, which was run first and starts with the raw fastq files (specifically, the version of that pipeline used for the 2025 Nat Comms paper is the "tf_paper_2025" branch).
You can get some needed reference files with this:
mkdir -p assets
cd assets
wget https://hgdownload.soe.ucsc.edu/goldenpath/hg38/bigZips/hg38.fa.masked.gz
gunzip hg38.fa.masked.gz
samtools faidx hg38.fa.masked
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/gencode.v39.basic.annotation.gtf.gz
gunzip gencode.v39.basic.annotation.gtf.gz
mkdir -p ../data
cd ../data
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_39/GRCh38.primary_assembly.genome.fa.gzFor the example, first edit the example/config.yaml config to point to a bowtie2 human index. You can then make a symlink to this repo's example/ directory in the other snakemake folder so the relative path in the config.yaml will work. Then run that cutnrun-snakemake pipeline passing it the example/config.yaml file:
snakemake --use-conda --rerun-incomplete --configfile example/config.yaml --printshellcmds --reason -j 12After running that pipeline to generate bam files, bigwigs and MACS2 files, this set of three can be run to perform initial analysis. Usage:
threads=1
config=example/config.yaml
snakemake -s snake.2.motifs.py --rerun-incomplete --configfile ${config} --printshellcmds --reason -j ${threads}
snakemake -s snake.3.annoAndPca.py --rerun-incomplete --configfile ${config} --printshellcmds --reason -j ${threads}
snakemake -s snake.4.differential.py --rerun-incomplete --configfile ${config} --printshellcmds --reason -j ${threads} To protect patient privacy, the example fastq files are not from the 2025 Nat Comms study, and therefore not what the files are named. They are actually public SRA files ZFX (named as CJUN) and ZNF24 (named as IgG) from MCF7 cells. These SRA accessions were used specifically:
SRR5931658
SRR5931657
SRR5112266
SRR5112267This performs motif analysis on the MACS2 peaks.
This annotates the peak files and performs PCA.
This compares the progenitor and day4 differentiation conditions.
Applies bcftools and the genomic fasta to the bam files to generate vcf files of variants, then applies the binomial test to those variants.
Just QC. This checks heterozygote ratios in vcfs - if ratios are not averaging at 50:50, then the data is from multiple people.
For the observed SNVs, use the genomic reference to generate tiled reads aroud the SNV in WT and variant forms, then map those to the genome. Check if the observed ratio of mapping is 50:50. If not, then a mapping bias could cause the ASB (not common). Needs bowtie2.
Compares ASB events with known motifs.
Adds information such as GTEX and MAF to the ASBs.
Tests overlaps between ASBs and gene categories.