Please note that a lot of the code here is taken directly from Darren Cusanovich's fly-atac repository: https://github.com/shendurelab/fly-atac, all credits go to the respective author(s). Most of the changes I made are intended for integration with our pipelines and for adaptation to our local cluster environment.
Also please note that while this repository is public, it is mainly intended for internal use by the Furlong lab members.
The starting point of this pipeline are the sequencing FASTQ files for READ1 and READ2. Sequencing is performed with four custom primers, two for READ1 and READ2 and two dedicated primers for reading INDEX1 and INDEX2. Our sequencing facility normally performs the bcl2toFASTQ convertion, and the indexes are added to the name of each read. Before starting, confirm that the two indexes (18 bp each) are correctly reported in the read name and separated by the symbol "+".
This pipeline was written for Python2, and it won't work if you are using Python3. Moreover several libraries are required, therefore I suggest creating a dedicated conda environment. I provide a .yml file from which you should be able to recreate the conda environment I use. Make sure to activate the conda environment before starting.
conda env create --prefix /path/yourenvironmentname/ --file sciATAC.yml
source activate /path/yourenvironmentname/The indexes of each read are checked against the whole space of possible barcode combinations, and they are flagged as exact, edited (if they can be corrected according to the distance rules) or failed matches. Note that no reads are discarded at this stage. Moreover reads are trimmed for the Nextera transposome sequence with Trimmomatic. New FASTQ files are generated for READ1 and READ2, which serve as input for mapping.
python sc_atac_fastq_fix_trim.py -R1 READ1 -R2 READ2 -O OUTDIR -P PREFIXStandard mapping with bowtie2. "GENOME" points to your favorite bowtie genome index. Standard error is redirected to a LOG file of your choosing. The output is a combined BAM file.
bowtie2 -p 8 -X 2000 -3 1 -x GENOME -1 READ1 -2 READ2 2> LOG | samtools view -bS - > OUTBAMAfter mapping, the BAM file is filtered for low quality mapping reads, unwanted chromosomes and reads with failed or ambigous barcodes. The output is a sorted BAM file.
samtools view -h -f3 -F12 -q10 INBAM | grep -v '[0-9]'$'\t'chrM | grep -v '[0-9]'$'\t'chrU | grep -v _CTF_ | grep -v _AMBIG_ | samtools view -Su - | samtools sort -@ 8 - -T temp -o OUTBAM; samtools index OUTBAMsamtools view -h -f3 -F12 -q10 INBAM | grep -v MT | grep -v CHR_ | grep -v KI | grep -v GL | grep -v _CTF_ | grep -v _AMBIG_ | samtools view -Su - | samtools sort -@ 8 - -T temp -o OUTBAM; samtools index OUTBAMDuplicate reads are removed for each cell barcode.
python sc_atac_true_dedup.py INBAM OUTBAM; samtools index OUTBAMIn this step, the barcodes that exist in the experiment are retrieved and used together with a read count based cutoff to call cells.
All possible combinations of Tn5 and PCR indexes used in the experiment are generated. The INPATH must lead to the location of the index files (pcr_P5.txt, pcr_P7.txt, tn5_i5.txt, tn5_i7.txt). The sample and plate lists must be R objects saved in R data format (.rds). The output are two tab-delimited text files listing the barcodes and the sample / plate assignment.
Rscript indextable_generator.R INPATH SAMPLELIST PLATELIST OUTPATH PREFIXUsing the deduplicated BAM generated above, reads are counted for each barcode. The output is a tab-delimited text file which reports, for each barcode, the read count and the condition / sample assignment (barcodes that don't exist in the experiment are labelled as 'bkgd').
python sc_atac_barcode_read_counter.py INBAM INDEXTABLE OUTFILEBarcodes that exist in the experiment and that are above the specified cutoff are determined to be cells. The script requires the same PREFIX used for the output report above. The cutoff parameter can be either a numeric value (ex. 500, it retains cells > 500 reads) or "auto", in which case a cutoff based on mixture modelling is applied (using package mclust). The cutoff is applied equally to any condition / sample present in the experiment. The output is a new index table containing only the barcodes surviving the cutoff and a .pdf file showing the read count distribution per condition / sample.
Rscript sc_atac_cell_cutoff.R PREFIX CUTOFFNew BAM files, retaining only the barcodes identified as cells above, are generated per condition / sample.
python sc_atac_library_deconvoluter.py INBAM INDEXTABLE OUTPREFIX .bamReads are counted by cell and by region specified in the BED file. The ouput is a tab-delimited count matrix of regions (rows) by cells (columns).
python sc_atac_window_counter.py INBAM INDEXTABLE INBED OUTFILE [Include sites with no reads (True/False)]