-
Notifications
You must be signed in to change notification settings - Fork 57
Description
I have been generating binary .glf.gz files from lists of .bam files, using -doGlf 1, and then using these as inputs for making .maf.gz files to get SNP lists for downstream analyses.
My reference genome lists the chromosomes as:
Chr1
Chr2
Chr3
Chr4
Chr5
Chr6
Chr7
Chr8
Chr9
Chr10
Chr11
Chr12
Chr13
Chr14
Chr15
Chr16
Chr17
ChrZ
Chrw
And this labelling structure is reflected in the associated .fai and .dict files.
Because I've been using -doGlf 1, I can't view my output.glf.gz files directly, but from looking at the .err reports generated by my script, it looks like they are correctly parsing through all the chromosomes, because they say things like:
-> Printing at chr: Chr1 pos:1667293 chunknumber 100 contains 1859 sites
...
-> Printing at chr: Chr2 pos:26107858 chunknumber 89500 contains 1169 sites
...
-> Printing at chr: Chr3 pos:106138 chunknumber 145200 contains 1048 sites
...
...
-> Printing at chr: Chrw pos:4770405 chunknumber 768400 contains 1415 sites
However, the maf.gz file generated from the next step lists only Chr1, and the .err file associated with that script goes from
-> Printing at chr: Chr1 pos:4951 chunknumber 100 contains 50 sites
to
-> Printing at chr: Chr1 pos:1360109951 chunknumber 27202200 contains 50 sites
I'm not sure what might be happening to cause this error. Here is an example of the code I have been using to generate the .maf.gz files:
angsd -glf $RESDIR/WGS-mothers-glf-1.glf.gz -fai "$REF".fai -anc "$REF" -out $RESDIR/${OUTNAME} \
-nThreads 32 -nInd 9 -doMajorMinor 5 -doMaf 1 -minInd 1 \
-rmTriallelic 0.05 -SNP_pval 1e-6
$REF directs to the .fa reference genome.
This error is occurring across all of the files I've generated, and happens whether I've used -doMajorMinor 5 (with -anc), or -doMajorMinor 1 (no -anc).
Does anyone know what might be happening to cause this?
Thank you for your help.
UPDATE:
I tried running -doGlf 4 on my smallest dataset, just to confirm it wasn't happening at that stage.
This is the code I used to generate the .glf.gz file:
angsd -b $DIR/angsd_bam_list.txt -ref $REF -out $RESDIR/${OUTNAME} \
-nThreads 50 -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 \
-minMapQ 20 -minQ 20 -minInd 1 -setMinDepthInd 1 -setMinDepth 3 -setMaxDepth 156 -doCounts 1 \
-GL 1 -doGlf 4
Which did have all its proper chromosomes listed.
And then the next step (in a separate script):
angsd -glf $RESDIR/ddrad-mothers-glf-4.glf.gz -fai "$REF".fai -anc "$REF" -out $RESDIR/${OUTNAME} \
-nThreads 32 -nInd 6 -doMajorMinor 5 -doMaf 1 -minInd 1 \
-rmTriallelic 0.05 -SNP_pval 1e-6
This did output only sites labelled Chr1, although looking at the file, it's not clear whether it is relabelling them, or just broke and truncated itself? These are the last few lines:
Chr1 7334606 C T C 1.000000 0.000000e+00 1
Chr1 7334608 C A C 1.000000 0.000000e+00 1
Chr1 733461
Second update: I found this error message in the .err file, but I don't see it in other .maf.gz file generating script .err files.
-> Printing at chr: Chr1 pos:7424951 chunknumber 148500 contains 50 sites
[glfReader.cpp:fetch():30] Error reading full chunk: bytesRead:311 expected:480 will exit
For reference, here is the code for the same data set generating a binary glf:
angsd -b $DIR/angsd_bam_list.txt -ref $REF -out $RESDIR/${OUTNAME} \
-nThreads 50 -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 \
-minMapQ 20 -minQ 20 -minInd 1 -setMinDepthInd 1 -setMinDepth 3 -setMaxDepth 156 -doCounts 1 \
-GL 1 -doGlf 1
And the next step:
angsd -glf $RESDIR/ddrad-mothers-glf-1-corr.glf.gz -fai "$REF".fai -anc "$REF" -out $RESDIR/${OUTNAME} \
-nThreads 32 -nInd 6 -doMajorMinor 5 -doMaf 1 -minInd 1 \
-rmTriallelic 0.05 -SNP_pval 1e-6
And the last few lines of the .maf.gz file thus generated:
Chr1 5855786 T C T 0.999998 0.000000e+00 4
Chr1 5855787 G A G 0.999998 0.000000e+00 4
Chr1 5855788 T G T 0.999998 0.000000e+00 4
And the last few lines of that script's .err file:
-> Printing at chr: Chr1 pos:5849951 chunknumber 117000 contains 50 sites
-> Printing at chr: Chr1 pos:5854951 chunknumber 117100 contains 50 sites
-> Done reading data waiting for calculations to finish
-> Waiting for nThreads:5
-> Done waiting for threads
-> Output filenames:
...
-> Arguments and parameters for all analysis are located in .arg file
-> Total number of sites analyzed: 5855789
-> Number of sites retained after filtering: 3895708
(Sorry for the big wall of text; I am just very confused at what might be happening!)