Dear all,
Hello! I know that this question may have been asked and answered. But I am in trouble, and it confuses me deeply.
I am a student that use SweeD for my research recently. It is really a good software that is very convenient to use.
However, I find that I cannot interpret the results easily.
Here are some of my commond lines:
1st, get chr and grid file
I used a grid value that I thought may divide each chr into ~20kb bins.
- $ head -n 100 input.vcf | grep -f list-chr.txt | sed 's/##contig=<ID=//g; s/,length=/\t/g; s/>//g' | gawk '{printf "%s\t%d\n", $1,$2/20000+1}' > list-chr-gird.txt
- $ head -n 3 list-chr-gird.txt
chr1 1011
chr2 1016
chr3 1024
chr1A 3569
2nd, RUN SweeD
$ while read i; do
chr=$(echo ${i} | gawk '{print $1}')
grid=$(echo ${i} | gawk '{print $2}')
echo "chr: ${chr}; grid: ${grid}"
SweeD -name input.${chr}.20kb -input input-${chr}.vcf -grid ${grid} -minsnps 200 -maf 0.05 -missing 0.1
done < list-chr-gird.txt
The above commond lines all run well! And the results files is also OK. But since I extract top 1% selective regions (Based on StartPos and EndPos, reffering to: #10), I found that many regions are overlapping.
Here are part of my results (I add the chr name at the beginning):
- $ cat results-top.txt
Chr Position Likelihood Alpha StartPos EndPos
chr1A 948409.7871 164.3214 1.314105e-06 8460 10074939
chr1A 1088402.3086 46.14001 1.777709e-06 8460 7837473
chr1A 59245295.5195 141.4263 2.429103e-07 9844380 71364650
chr1A 59265294.4512 164.2197 2.409952e-07 9474848 71364650
when I extract "chr StartPos EndPos" INFO, and used "bedtools merge -i", I got a region "chr1A 8460 71364650", ~71Mb, it's amazing!
Like this, since my ref genom is ~1Gb, I got ~400MB regions as selected regions, it may be impossible!
But in present issues: [https://github.com//issues/10] (#10),
selected regions is [Start, End], or I misunderstand it or grid values?
I also tried to divide each chr into ~50kb (by "gawk '{printf "%s\t%d\n", $1,$2/50000+1}'") or even ~1kb (by "gawk '{printf "%s\t%d\n", $1,$2/1000+1}'"), and got ~400Mb and ~500Mb regions as selected regions, separately.
So, my questions are:
- Is: grid value = (chr length) / (window size)?
- How to choose suitable grid value (based on LD or others?)?
- Why merged selected regions are so long, and how to choose suitable selected regions?
Best Wishes!
Dear all,
Hello! I know that this question may have been asked and answered. But I am in trouble, and it confuses me deeply.
I am a student that use SweeD for my research recently. It is really a good software that is very convenient to use.
However, I find that I cannot interpret the results easily.
Here are some of my commond lines:
1st, get chr and grid file
I used a grid value that I thought may divide each chr into ~20kb bins.
chr1 1011
chr2 1016
chr3 1024
chr1A 3569
2nd, RUN SweeD
The above commond lines all run well! And the results files is also OK. But since I extract top 1% selective regions (Based on StartPos and EndPos, reffering to: #10), I found that many regions are overlapping.
Here are part of my results (I add the chr name at the beginning):
Chr Position Likelihood Alpha StartPos EndPos
chr1A 948409.7871 164.3214 1.314105e-06 8460 10074939
chr1A 1088402.3086 46.14001 1.777709e-06 8460 7837473
chr1A 59245295.5195 141.4263 2.429103e-07 9844380 71364650
chr1A 59265294.4512 164.2197 2.409952e-07 9474848 71364650
when I extract "chr StartPos EndPos" INFO, and used "bedtools merge -i", I got a region "chr1A 8460 71364650", ~71Mb, it's amazing!
Like this, since my ref genom is ~1Gb, I got ~400MB regions as selected regions, it may be impossible!
But in present issues: [https://github.com//issues/10] (#10),
selected regions is [Start, End], or I misunderstand it or grid values?
I also tried to divide each chr into ~50kb (by "gawk '{printf "%s\t%d\n", $1,$2/50000+1}'") or even ~1kb (by "gawk '{printf "%s\t%d\n", $1,$2/1000+1}'"), and got ~400Mb and ~500Mb regions as selected regions, separately.
So, my questions are:
Best Wishes!