Skip to content

The selective regions to find candidate genes  #23

@QianghuiZhu

Description

@QianghuiZhu

Dear all,
Hello! I know that this question may have been asked and answered. But I am in trouble, and it confuses me deeply.
I am a student that use SweeD for my research recently. It is really a good software that is very convenient to use.
However, I find that I cannot interpret the results easily.


Here are some of my commond lines:

1st, get chr and grid file

I used a grid value that I thought may divide each chr into ~20kb bins.

  • $ head -n 100 input.vcf | grep -f list-chr.txt | sed 's/##contig=<ID=//g; s/,length=/\t/g; s/>//g' | gawk '{printf "%s\t%d\n", $1,$2/20000+1}' > list-chr-gird.txt
  • $ head -n 3 list-chr-gird.txt
    chr1 1011
    chr2 1016
    chr3 1024
    chr1A 3569

2nd, RUN SweeD

$ while read i; do 
  chr=$(echo  ${i} | gawk '{print $1}')
  grid=$(echo  ${i} | gawk '{print $2}')
  echo "chr: ${chr}; grid: ${grid}"
  SweeD -name input.${chr}.20kb -input input-${chr}.vcf -grid ${grid} -minsnps 200 -maf 0.05 -missing 0.1
done < list-chr-gird.txt

The above commond lines all run well! And the results files is also OK. But since I extract top 1% selective regions (Based on StartPos and EndPos, reffering to: #10), I found that many regions are overlapping.
Here are part of my results (I add the chr name at the beginning):

  • $ cat results-top.txt
    Chr Position Likelihood Alpha StartPos EndPos
    chr1A 948409.7871 164.3214 1.314105e-06 8460 10074939
    chr1A 1088402.3086 46.14001 1.777709e-06 8460 7837473
    chr1A 59245295.5195 141.4263 2.429103e-07 9844380 71364650
    chr1A 59265294.4512 164.2197 2.409952e-07 9474848 71364650

when I extract "chr StartPos EndPos" INFO, and used "bedtools merge -i", I got a region "chr1A 8460 71364650", ~71Mb, it's amazing!


Like this, since my ref genom is ~1Gb, I got ~400MB regions as selected regions, it may be impossible!
But in present issues: [https://github.com//issues/10] (#10),
selected regions is [Start, End], or I misunderstand it or grid values?
I also tried to divide each chr into ~50kb (by "gawk '{printf "%s\t%d\n", $1,$2/50000+1}'") or even ~1kb (by "gawk '{printf "%s\t%d\n", $1,$2/1000+1}'"), and got ~400Mb and ~500Mb regions as selected regions, separately.

So, my questions are:

  1. Is: grid value = (chr length) / (window size)?
  2. How to choose suitable grid value (based on LD or others?)?
  3. Why merged selected regions are so long, and how to choose suitable selected regions?

Best Wishes!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions