The selective regions to find candidate genes 

Dear all, 
Hello! I know that this question may have been asked and answered. But I am in trouble, and it confuses me deeply.
I am a student that use SweeD for my research recently. It is really a good software that is very convenient to use.
However, I find that I cannot interpret the results easily.

-------------------
Here are some of my commond lines: 

## 1st, get chr and grid file
I used a grid value that I thought may divide each chr into ~20kb bins.
- $ head -n 100 input.vcf | grep -f list-chr.txt | sed 's/##contig=<ID=//g; s/,length=/\t/g; s/>//g' | gawk '{printf "%s\t%d\n", $1,$2/20000+1}' > list-chr-gird.txt
- $ head -n 3 list-chr-gird.txt
chr1	1011
chr2	1016
chr3	1024
chr1A   3569

## 2nd, RUN SweeD 
```
$ while read i; do 
  chr=$(echo  ${i} | gawk '{print $1}')
  grid=$(echo  ${i} | gawk '{print $2}')
  echo "chr: ${chr}; grid: ${grid}"
  SweeD -name input.${chr}.20kb -input input-${chr}.vcf -grid ${grid} -minsnps 200 -maf 0.05 -missing 0.1
done < list-chr-gird.txt
```
---------------------------
The above commond lines all run well! And the results files is also OK. But since I extract top 1% selective regions (Based on StartPos and EndPos, reffering to: https://github.com/alachins/sweed/issues/10), I found that many regions are overlapping.
Here are part of my results (I add the chr name at the beginning): 
- $ cat results-top.txt
Chr	Position	Likelihood	Alpha	StartPos	EndPos
chr1A	948409.7871	164.3214	1.314105e-06	8460	10074939
chr1A	1088402.3086	46.14001	1.777709e-06	8460	7837473
chr1A	59245295.5195	141.4263	2.429103e-07	9844380	71364650
chr1A	59265294.4512	164.2197	2.409952e-07	9474848	71364650

when I extract "chr StartPos EndPos" INFO, and used "_bedtools merge -i_", I got a region "chr1A 8460 71364650", ~71Mb, it's amazing!

-----------------------------
Like this, since my ref genom is ~1Gb, I got ~400MB regions as selected regions, it may be impossible!
But in present issues: [https://github.com/alachins/sweed/issues/10] (https://github.com/alachins/sweed/issues/10), 
**selected regions is [Start, End], or I misunderstand it or grid values?**
I also tried to divide each chr into ~50kb (by "_gawk '{printf "%s\t%d\n", $1,$2/50000+1}'_") or even ~1kb (by "_gawk '{printf "%s\t%d\n", $1,$2/1000+1}'_"), and got ~400Mb and ~500Mb regions as selected regions, separately.

So, my questions are: 
1.  Is: grid value = (chr length) / (window size)?
2.  How to choose suitable grid value (based on LD or others?)?
3.  Why merged selected regions are so long, and how to choose suitable selected regions?

Best Wishes!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The selective regions to find candidate genes #23

1st, get chr and grid file

2nd, RUN SweeD

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

The selective regions to find candidate genes #23

Description

1st, get chr and grid file

2nd, RUN SweeD

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions