sci-ATAC-seq processing pipeline

Please note that a lot of the code here is taken directly from Darren Cusanovich's fly-atac repository: https://github.com/shendurelab/fly-atac, all credits go to the respective author(s). Most of the changes I made are intended for integration with our pipelines and for adaptation to our local cluster environment.

Also please note that while this repository is public, it is mainly intended for internal use by the Furlong lab members.

Step 0. Preparations

The starting point of this pipeline are the sequencing FASTQ files for READ1 and READ2. Sequencing is performed with four custom primers, two for READ1 and READ2 and two dedicated primers for reading INDEX1 and INDEX2. Our sequencing facility normally performs the bcl2toFASTQ convertion, and the indexes are added to the name of each read. Before starting, confirm that the two indexes (18 bp each) are correctly reported in the read name and separated by the symbol "+".

This pipeline was written for Python2, and it won't work if you are using Python3. Moreover several libraries are required, therefore I suggest creating a dedicated conda environment. I provide a .yml file from which you should be able to recreate the conda environment I use. Make sure to activate the conda environment before starting.

conda env create --prefix /path/yourenvironmentname/ --file sciATAC.yml

source activate /path/yourenvironmentname/

Step 1. Barcode fixation and trimming

The indexes of each read are checked against the whole space of possible barcode combinations, and they are flagged as exact, edited (if they can be corrected according to the distance rules) or failed matches. Note that no reads are discarded at this stage. Moreover reads are trimmed for the Nextera transposome sequence with Trimmomatic. New FASTQ files are generated for READ1 and READ2, which serve as input for mapping.

python sc_atac_fastq_fix_trim.py -R1 READ1 -R2 READ2 -O OUTDIR -P PREFIX

Step 2. Mapping

Standard mapping with bowtie2. "GENOME" points to your favorite bowtie genome index. Standard error is redirected to a LOG file of your choosing. The output is a combined BAM file.

bowtie2 -p 8 -X 2000 -3 1 -x GENOME -1 READ1 -2 READ2 2> LOG | samtools view -bS - > OUTBAM

Step 3. Quality filtering

After mapping, the BAM file is filtered for low quality mapping reads, unwanted chromosomes and reads with failed or ambigous barcodes. The output is a sorted BAM file.

Drosophila

samtools view -h -f3 -F12 -q10 INBAM | grep -v '[0-9]'$'\t'chrM | grep -v '[0-9]'$'\t'chrU | grep -v _CTF_ | grep -v _AMBIG_ | samtools view -Su - | samtools sort -@ 8 - -T temp -o OUTBAM; samtools index OUTBAM

Human

samtools view -h -f3 -F12 -q10 INBAM | grep -v MT | grep -v CHR_ | grep -v KI | grep -v GL | grep -v _CTF_ | grep -v _AMBIG_ | samtools view -Su - | samtools sort -@ 8 - -T temp -o OUTBAM; samtools index OUTBAM

Step 4. Deduplication

Duplicate reads are removed for each cell barcode.

python sc_atac_true_dedup.py INBAM OUTBAM; samtools index OUTBAM

Step 5. Cell calling

In this step, the barcodes that exist in the experiment are retrieved and used together with a read count based cutoff to call cells.

Generate index table of experiment barcodes

All possible combinations of Tn5 and PCR indexes used in the experiment are generated. The INPATH must lead to the location of the index files (pcr_P5.txt, pcr_P7.txt, tn5_i5.txt, tn5_i7.txt). The sample and plate lists must be R objects saved in R data format (.rds). The output are two tab-delimited text files listing the barcodes and the sample / plate assignment.

Rscript indextable_generator.R INPATH SAMPLELIST PLATELIST OUTPATH PREFIX

Count reads per barcode present in the experiment

Using the deduplicated BAM generated above, reads are counted for each barcode. The output is a tab-delimited text file which reports, for each barcode, the read count and the condition / sample assignment (barcodes that don't exist in the experiment are labelled as 'bkgd').

python sc_atac_barcode_read_counter.py INBAM INDEXTABLE OUTFILE

Read count cutoff

Barcodes that exist in the experiment and that are above the specified cutoff are determined to be cells. The script requires the same PREFIX used for the output report above. The cutoff parameter can be either a numeric value (ex. 500, it retains cells > 500 reads) or "auto", in which case a cutoff based on mixture modelling is applied (using package mclust). The cutoff is applied equally to any condition / sample present in the experiment. The output is a new index table containing only the barcodes surviving the cutoff and a .pdf file showing the read count distribution per condition / sample.

Rscript sc_atac_cell_cutoff.R PREFIX CUTOFF

Step 6. Library deconvolution

New BAM files, retaining only the barcodes identified as cells above, are generated per condition / sample.

python sc_atac_library_deconvoluter.py INBAM INDEXTABLE OUTPREFIX .bam

Step 7. Count matrix generation

Reads are counted by cell and by region specified in the BED file. The ouput is a tab-delimited count matrix of regions (rows) by cells (columns).

python sc_atac_window_counter.py INBAM INDEXTABLE INBED OUTFILE [Include sites with no reads (True/False)]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sci-ATAC-seq processing pipeline

Step 0. Preparations

Step 1. Barcode fixation and trimming

Step 2. Mapping

Step 3. Quality filtering

Drosophila

Human

Step 4. Deduplication

Step 5. Cell calling

Generate index table of experiment barcodes

Count reads per barcode present in the experiment

Read count cutoff

Step 6. Library deconvolution

Step 7. Count matrix generation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
LICENSE		LICENSE
NexteraPE-PE.fa		NexteraPE-PE.fa
README.md		README.md
indextable_generator.R		indextable_generator.R
pcr_P5.txt		pcr_P5.txt
pcr_P7.txt		pcr_P7.txt
sc_atac_10bpbarcode_split.py		sc_atac_10bpbarcode_split.py
sc_atac_barcode_read_counter.py		sc_atac_barcode_read_counter.py
sc_atac_cell_cutoff.R		sc_atac_cell_cutoff.R
sc_atac_fastq_fix_trim.py		sc_atac_fastq_fix_trim.py
sc_atac_library_deconvoluter.py		sc_atac_library_deconvoluter.py
sc_atac_true_dedup.py		sc_atac_true_dedup.py
sc_atac_window_counter.py		sc_atac_window_counter.py
sciATAC.yml		sciATAC.yml
tn5_i5.txt		tn5_i5.txt
tn5_i7.txt		tn5_i7.txt
trimmomatic-0.32.jar		trimmomatic-0.32.jar

License

ssecchia/sci-ATAC_processing

Folders and files

Latest commit

History

Repository files navigation

sci-ATAC-seq processing pipeline

Step 0. Preparations

Step 1. Barcode fixation and trimming

Step 2. Mapping

Step 3. Quality filtering

Drosophila

Human

Step 4. Deduplication

Step 5. Cell calling

Generate index table of experiment barcodes

Count reads per barcode present in the experiment

Read count cutoff

Step 6. Library deconvolution

Step 7. Count matrix generation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages