Skip to content

Issue with region size estimation for VCF input #3

@twooldridge

Description

@twooldridge

Hello,

I've been testing running RAiSD on subregions of my VCF dataset, as the combination of sample size and genome size makes each chromsome VCF quite large. I've run into a strange error message for some of these subregions. An example, including the command I'm running:

 Command line        :	RAiSD-AI -n Chr24_25000001_50000000_large -I vcfs/24.25000001.50000000.vcf -S ./poplists/large.txt -y 2 -M 0 -w 50 -f -R -D -A 0.01 -op RSD-DEF 
 Operation mode      :	mu-statistic scan
 Window width        :	50
 Sample size         :	404 [Total: 1193, Not found: 0, Requested 404]
 Dataset format      :	vcf
 var-exp             :	1.0
 sfs-exp             :	1.0
 ld-exp              :	1.0
 Rscript version     :	
 A pattern structure of 131072 patterns (max. capacity) and approx. 16 MB memory footprint has been created.

 The pattern structure has been resized to 74898 patterns (max. capacity) and approx. 16 MB memory footprint.


ERROR: A VCF entry is found at position 49914576, whereas the region size is set to 49914531 via -B.
       (-B is not required with VCF files)

What's confusing me is that the error is regarding the -B flag, which I don't set myself but I imagine is set internally after parsing the VCF. Here's the example output from a run that did finish successfully, using the same parameters, same sample set and a different region of the same physical size (25Mb):

 Command line        :  RAiSD-AI -n Chr6_50000001_75000000_large -I vcfs/6.50000001.75000000.vcf -S /private/groups/shapirolab/brock/cows/poplists/large.txt -y 2 -M 0 -w 50 -f -R -D -A 0.01 -o>
 Operation mode      :  mu-statistic scan
 Window width        :  50
 Sample size         :  404 [Total: 1193, Not found: 0, Requested 404]
 Dataset format      :  vcf
 var-exp             :  1.0
 sfs-exp             :  1.0
 ld-exp              :  1.0
 Rscript version     :  
 A pattern structure of 131072 patterns (max. capacity) and approx. 16 MB memory footprint has been created.

 The pattern structure has been resized to 74898 patterns (max. capacity) and approx. 16 MB memory footprint.

 0: Set 6 | Sites 909899 | SNPs 94777 | Region 74999978 - muVar 51217712 1.343e+00 | muSFS 53015643 6.835e+00 | muLD 57112756 5.482e+00 | mu 59933220 1.048e+01 

 Sets (total)         : 1
 Sets (processed)     : 1
 Sets (not processed) : 0

 Total execution time 2285.44852 seconds
 Total memory footprint 92733 kbytes

All vcfs are composed entirely of biallelic SNPs with MAF greater than 0. Also, here's a snapshot of the variants surrounding the region stop site for the first example (info & genotypes ommitted for space). Nothing looks unusual, and in the full VCF there are thousands of variants following it:

Image

Any thoughts as to what might be driving this? I'm going to try to keep narrowing down the problem VCF and see if I can identify what's causing the error. Thanks in advance for your help!

Best,
Brock

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions