Skip to content

Latest commit

 

History

History
125 lines (94 loc) · 5.79 KB

File metadata and controls

125 lines (94 loc) · 5.79 KB

TERate

TERate is a computational pipeline to measure transcription elongation rates (TERs) with 4sUDRB-Seq.

Features

Measure transcription elongation rates (TERs) with 4sUDRB-Seq.
Different time points should be calculated separately.

BAM format file

BAM file was originally mapped form TopHat, Bowtie, Bowtie2 or BWA.

##Prerequisites

###Software / Package

Usage:


To obtain average reads (Hits) distribution of 4sUDRB-Seq, BAM format file was converted to bedgraph format file firstly. Following BAM format file was illustrated by the case of TopHat (v2.0.9) results of 4sUDRB-Seq 10 minute sample. All C++ and shell scripts were marked 'bold italic'.

    1. Please add the TERate directory to your $PATH first or copy all scripts to your current work directory ('TERate_output/').
export PATH="~/TERate-master/:$PATH";

or copy all scripts to your current work directory ('TERate_output/')

mkdir TERate_output
cd TERate_output/
cp ~/TERate-master/bam2bedgraph ../TERate_output/
cp ~/TERate-master/gene_to_window ../TERate_output/
cp ~/TERate-master/split_bedgraph.sh ../TERate_output/
cp ~/TERate-master/split_refFlat.sh ../TERate_output/
cp ~/TERate-master/bedgraph_to_hits ../TERate_output/
cp ~/TERate-master/TER_calculate ../TERate_output/
    1. BAM to bedgraph with 'bam2bedgraph' script from bedtools. And select longest isoform of gene, further to split gene annotation file refFlat_hg38.txt (Download form UCSC Genome Browser) into 300 bp bins/windows with 'gee_to_window' script.
./bam2bedgraph accepted_hits.bam > accepted_hits.bedgraph

perl -alne '$,="\t";print (@F[0,1],$F[5]-$F[4])' refFlat_hg38.txt |sort -k1,1 -k3,3gr |sort -k1,1 -u |cut -f 2 > refFlat_hg38_longiso.eid
perl select_ID.pl refFlat_hg38.txt refFlat_hg38_longiso.eid 2 > refFlat.txt

./gene_to_window refFlat.txt 300 > refFlat_bins.txt
    1. To reduce time consumption of TERate, proposal for split 'bedgraph file' (accepted_hits.bedgraph) and 'refFlat file' (refFlat_bins.txt) into each chromosome with 'split_bedgraph.sh' and 'split_refFlat.sh' scripts. Create split work directory 'split/' and split bedgraph and refFlat into 300 bp bins/windows with 'nohup' for backstage running.
mkdir split
cd split/
sh ../split_bedgraph.sh ../accepted_hits.bedgraph
sh ../split_refFlat.sh ../refFlat_bins.txt
    1. After 'split_refFlat.sh' and 'split_bedgraph.sh' finished, then using 'bedgraph_to_hits' to calculate Hits for each bins/windows (~ 3-4 hr time consumption). Calculate each bin reads number (Hits) with 'nohup' for backstage running.
ls |grep "bin" |awk -F"_" '{print "nohup ../bedgraph_to_hits "$1"_bedgraph.txt "$1"_bin.txt > "$1"_hits.txt &"}' |sh
    1. When script 'bedgraph_to_hits' finished, return to 'TERate_output/' directory to combine all hit results and sort with gene name.
cd ../
cat split/chr1_hits.txt split/chr2_hits.txt split/chr3_hits.txt split/chr4_hits.txt split/chr5_hits.txt split/chr6_hits.txt split/chr7_hits.txt split/chr8_hits.txt split/chr9_hits.txt split/chr10_hits.txt split/chr11_hits.txt split/chr12_hits.txt split/chr13_hits.txt split/chr14_hits.txt split/chr15_hits.txt split/chr16_hits.txt split/chr17_hits.txt split/chr18_hits.txt split/chr19_hits.txt split/chr20_hits.txt split/chr21_hits.txt split/chr22_hits.txt split/chrX_hits.txt split/chrY_hits.txt split/chrM_hits.txt > combine_hits.txt
sort -k4,4 -k1,1 -k2,2n -k3,3nr combine_hits.txt > sorted_hits.txt
    1. Calculate transcription elongation rate for each gene with 'calculate_TER' script.
./calculate_TER sorted_hits.txt 10 300 |sort -k1,1 -k4,4nr |awk '{a[$1,++b[$1]]=$0}END{for(i in b)print a[i,1]}' > TERate_output.txt

'TERate_output.txt' is the result of TERate pipeline.


Note

Gene annotation file refFlat.txt is in the format (Gene Predictions and RefSeq Genes with Gene Names) below (see details in the example file).

Field Description
geneName Name of gene
isoformName Name of isoform
chrom Reference sequence
strand + or - for strand
txStart Transcription start position
txEnd Transcription end position
cdsStart Coding region start
cdsEnd Coding region end
exonCount Number of exons
exonStarts Exon start positions
exonEnds Exon end positions

Output

See details in the example file.

Field Description
geneName Name of gene
isoformName Name of isoform
strand + or - for strand
TER Transcription elongation rate (bp/m)

Requirements

Citation

Zhang Y*, Xue W*, Li X, Zhang J, Chen S, Zhang JL,Yang L# and Chen LL#. The Biogenesis of Nascent Circular RNAs. Cell Rep, 2016.

License

Copyright (C) 2016 YangLab. See the LICENSE file for license rights and limitations (MIT).