-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Using NA12878.ILLUMINA.SRP012400.Xprize_HiseqSRR636604.bam (ftp://ftp-trace.ncbi.nih.gov/giab/ftp/technical/NA12878_data_other_projects/alignment/NA12878.ILLUMINA.SRP012400.Xprize_HiseqSRR636604.bam) as an example, WGS_Stats_v1.java produces NA12878.ILLUMINA.SRP012400.Xprize_HiseqSRR636604.wholeGenomeCov.fasta with the coverages all displayed in the wrong position.
The first clue is the FASTA sequence lines:
>1 1 249250622
>2 1 243199374
>3 1 198022431
>4 1 191154277
>5 1 180915261
(etc...)
These imply one-based coordinates (since the first coordinate is 1, the convention would usually be to have closed intervals on both sides), but the end coordinates are all one more then the number of bases in the reference genome.
The first non-zero coverage site in this file should be 10245 (I verified this by looking in IGV). but the output of WGS_Stats_v1.java has 102 rows of zeros, followed by 45 zeros, then a 1, so that would place the first nonzero coverage at 10246-- one too high.
So, it seems to me that an extra 0 is added at the beginning of every genomic coverage file.