diff --git a/README.md b/README.md index 79fbd74..6b89813 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,7 @@ Developed by [Bishwa K. Giri](mailto:kirannbishwa01@gmail.com) in the [Remington Lab](website?) at the University of North Carolina at Greensboro, Biology department. ## Citation -Giri, B. K., Remington D. L. Haplotype phasing in heterogenous genome and hybrids using phase-Extender and phase-Stitcher. biorxiv (2018) [not uploaded yet]. +Giri, B. K., Remington D. L. Haplotype phasing in heterozygous genome and hybrids using phase-Extender and phase-Stitcher. biorxiv (2018) [not uploaded yet]. ## AUTHOR/SUPPORT Bishwa K. Giri (bkgiri@uncg.edu; kirannbishwa01@gmail.com) \ @@ -22,35 +22,38 @@ Support @ https://groups.google.com/d/forum/phase-extender
# BACKGROUND -Haplotype phasing is a second "go to" problem in bioinformatics after read alignment. The importance of haplotype phasing applies directly to the analyses of ASE (allele specific expression), preparation of extended haplotype for EHH (extended haplotype homozygosity) test, and preparation of dipolid genome which will soon be a new standard in bioinformatics in coming years, etc. The necessity for haplotype phasing (and eventually diploid genome) increases with the increase in heterozygosity in the genome because higher hetetogeneity leads to larger alignment bias and complicates the reliability of the variants that are called using that alignment data (SAM, BAM files). +Haplotype phasing is the second "go to" problem in bioinformatics after read alignment. The importance of haplotype phasing applies directly to the analyses of ASE (allele specific expression), preparation of extended haplotype for EHH (extended haplotype homozygosity) test, and preparation of a diploid genome which will soon be a new standard in bioinformatics in the coming years. The necessity for haplotype phasing (and eventually diploid genome) increases with the increase in heterozygosity in the genome because higher hetetogeneity leads to larger alignment bias and complicates the reliability of the variants that are called using alignment data (SAM, BAM files). -The classical approach to haplotype phasing involves application of LD (linkage disequilibrium) test between two heterozygous markers, that began with the preparation of genetic map by Alfred Sturtevant. Any existing population based haplotype phasing tool therefore uses the LD test with varying degree of complexity based on available sample size, markers, types of the marker and relationship between samples. For haplotype phasing tools like [Beagle](https://faculty.washington.edu/browning/beagle/beagle.html), [ShapeIT](http://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html), [impute2](http://mathgen.stats.ox.ac.uk/impute/impute_v2.html), etc are used dominantly. These tools make use of the variants (SNPs, InDels) along the length of the genome by treating chain of variants along genomic positions singly and independently. So, for a diploid organism that contains chain of variants with "n" heterozygote sites there exists "2^n" possible haplotypes. To certain extent this "2^n" problem is approached and eased by sampling genotype data from short genomic regions and applying identity by descent (IBD), most common haplotype method, etc. on that fragment to infer the possible haplotype in that region. However, application of these methods may not be optimal in solving phase states in organisms with heterogenous genome, are hybrids and/or have very few or no reference panels and abundant genotype data. +The classical approach to haplotype phasing involves application of LD (linkage disequilibrium) test between two heterozygous markers, that began with the preparation of genetic map by Alfred Sturtevant. Any existing population based haplotype phasing tool therefore uses the LD test with varying degree of complexity based on available sample size, markers, types of the marker and relationship between samples. For haplotype phasing tools like [Beagle](https://faculty.washington.edu/browning/beagle/beagle.html), [ShapeIT](http://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html), [impute2](http://mathgen.stats.ox.ac.uk/impute/impute_v2.html), etc are used dominantly. These tools make use of the variants (SNPs, InDels) along the length of the genome by treating chain of variants along genomic positions singularly and independently. So, for a diploid organism that contains chain of variants with "n" heterozygotic sites there exists "2^n" possible haplotypes. To a certain extent this "2^n" problem is approached and eased by sampling genotype data from short genomic regions and then applying a popular method called identity by descent (IBD) on that fragment to infer the possible haplotype. However, application of these methods may not be optimal in solving phase states in organisms with a heterozygous genome, hybrids and/or have very few or no reference panels and abundant genotype data. **The main issues of the tools that deal with haplotype phasing in "2^n" way can be summarized as:** - Increased computation burden due to "2^n" problem. - Problem in phasing rare variants. - Is mostly applicable to human and organism with reference haplotype panel and abundant genomic data. - - Not optimal for organisms that have heterogenous genome, or are outbreds, or are hybrids, or belong to recent group of organisms that have reference genome prepared. + - Not optimal for organisms that have heterozygous genome, or are outbreds, or are hybrids, or belong to recent group of organisms that have reference genome prepared. -Therefore, using LD between two adjacent SNPs with small population of samples isn't able to provide enough resolution to solve GW haplotype preparation, which could result in excessive switch errors. Additionally, in heterogenous and hybrid genome, problems arising due to switch errors in downstream analyses could be manifold. +Therefore, using LD between two adjacent SNPs with small population of samples will not be able to provide enough resolution to solve GW haplotype preparation, which could result in excessive switch errors. Additionally, in heterozygous and hybrid genomes, problems arising due to switch errors in downstream analyses could be manifold. ReadBackphasing is slowly emerging as a new and more reliable method for preparation of short range haplotypes by joining heterozygous variants that are covered by sequence reads. These short haplotypes can be further elongated by adding sequencing reads of longer length (PacBio reads) or by adding more genomic and RNAseq reads from the same individual; see tools [WhatsHap](https://whatshap.readthedocs.io/en/latest/), [hapCut](https://github.com/vibansal/hapcut), [phaser](https://github.com/secastel/phaser/blob/master/phaser/README.md) etc. -**But, issues with existing RBphase method and/or tools still remain and can be summmarized as:** + +**But, issues with existing RBphase method and/or tools still remain and can be summarized as:** - are mainly aimed at preparing long range haplotype but not necessarily genome wide. - - existing RBphase methods are only targeted at individual or family level, i.e they required multiple inputs of "BAM" and "VCF" files for the same individual and/or trios. The increase in the size of phased haplotype blocks only depend upon multiple BAM files, or multiple sets of longer reads from the same individual. + - existing RBphase methods are only targeted at individual or family level, i.e they require multiple inputs of "BAM" and "VCF" files for the same individual and/or trios. The increase in the size of phased haplotype blocks only depend upon multiple BAM files, or multiple sets of longer reads from the same individual. - Referring to point 2 -> integration of RBPhased data with population based phasing is still missing i.e they are not able to solve phase state of two haplotype blocks in same sample by using information phase state of the haplotype from other samples. -Even with large PE reads in future there will always be coverage issues which will break genome wide haploltype in multiple segments. It is therefore imperative to establish a method that is able to integrate the RBphased data among samples to solve and prepare long range haplotypes in individual genomes. While integrating RBphase methods with popualtion data, the concept of LD still applies but the phasing accuracy are highly increased owing to phased information contained in short haplotype blocks across several samples. Addtionally, this method will put heterogenous genome at most advantage because of large RBphase blocks they are able to produce. +Even with large PE reads in future there will always be coverage issues which will break genome wide haplotype in multiple segments. It is therefore imperative to establish a method that is able to integrate the RBphased data among samples to solve and prepare long range haplotypes in individual genomes. While integrating RBphase methods with population data, the concept of LD still applies but the phasing accuracy is highly increased owing to phased information contained in short haplotype blocks across several samples. Additionally, this method will put heterozygous genome at most advantage because of large RBphase blocks they are able to produce. -For specimen that don't have multiple genomic and rnaseq sources, reference haplotype panel the existing RBphasing tools are only able to yield short range haplotypes. Even with multiple bam inputs these tools are still not aimed at preparing genome wide haplotype, but rather are limited at preparing long streches of haplotype for certain part of a genome. Additionally, these RBphase tools do not have method to solve phase state between two adjacent haplotype blocks that are not spanned by sequence reads (in a sample) i.e there is no method incorporated to transfer readbackphased haplotype data information across population of samples. +For specimens that don't have multiple genomic and transcriptomic sources, reference haplotype panel the existing RBphasing tools are only able to yield short range haplotypes. Even with multiple bam inputs these tools are still not aimed at preparing genome wide haplotype, but rather are limited at preparing long streches of haplotype for certain parts of a genome. Additionally, these RBphase tools do not have method to solve phase state between two adjacent haplotype blocks that are not spanned by sequence reads (in a sample) i.e there is no method incorporated to transfer readbackphased haplotype data information across population of samples. -With increase in the size of PE (paired end) reads from illumina and availability of longer sequences from PacBio, it is possible to prepare longer short range haplotypes that span more than 2 heterozygous site in the genome. Thus, the issue with "2^n" problem can be overcome by **1)** first preparing RBphased haplotype blocks using sequence reads, **2)** Next, the two consecutive haplotypes blocks can be joined in a sample one at a time by reading phase state of the break point in other population of samples. The size of RBphased blocks also increases with increase in heterozygosity in the genome thereby making this method more suitable for highly heterozygous populations. And, there is always more information in RBphased haplotype compared with a single SNP or InDel thereby overcoming the issues with switcherrors when preparing the long range haplotype in heterozygous population. +With increase in the size of PE (paired end) reads from Illumina and availability of longer sequences from PacBio, it is possible to prepare longer short range haplotypes that span more than 2 heterozygous site in the genome. Thus, the issue with "2^n" problem can be overcome by + **1)** first preparing RBphased haplotype blocks using sequence reads, + **2)** Next, the two consecutive haplotypes blocks can be joined in a sample one at a time by reading phase state of the break point in other population of samples. The size of RBphased blocks also increases with increase in heterozygosity in the genome thereby making this method more suitable for highly heterozygous populations. And, there is always more information in RBphased haplotype compared with a single SNP or InDel thereby overcoming the issues with switch-errors when preparing the long range haplotype in heterozygous population. -So, RBphasing method is able to yield higher variants per haplotype block thus making phase-Extender a better method and tool when working with genotype with higher heterozygosity (out crossing population and hybrids). +So, RBphasing method is able to yield higher variants per haplotype block thus making phase-Extender a better method and tool when working with genotypes (or genomes )having higher heterozygosity (out crossing population and hybrids). -Existing haplotype phasing tools (that use RBphase and non-RBphase method) are oriented towards organism that have large genotype data and several haplotype reference panels available. For non-model organism these methods are therefore sub-optimal. RBphasing provides some phasing benefits because the phase state along short blocks can be readily extracted. However, in emerging research models converting the short haplotype fragments to longer haplotype strings can be challenging due to small sample size, low coverage within each sample, lack of enough genotype data, absence of reference panels. **phASE-Extender** overcomes these limitations by using the short range RBphased haplotype blocks from several samples that have several haplotype break points. **phase-Extender** is aimed at filling this technical gap by using the short fragments haplotypes among several samples, where the breakpoint of one sample is bridged by several other samples. +Existing haplotype phasing tools (that use RBphase and non-RBphase method) are oriented towards organism that have large genotype data and several haplotype reference panels available. For non-model organism these methods are therefore sub-optimal. RBphasing provides some phasing benefits because the phase state along short blocks that can be readily extracted. However, in emerging research models converting the short haplotype fragments to longer haplotype strings can be challenging due to a small sample size, low coverage within each sample, lack of enough genotype data, absence of reference panels. **phASE-Extender** overcomes these limitations by using the short range RBphased haplotype blocks from several samples that have several haplotype break points. **phase-Extender** is aimed at filling this technical gap by using the short fragments haplotypes from several samples, where the breakpoint of one sample is bridged by several other samples. -**phASE-Extender** starts with already prepared RBphased haplotype blocks of several individuals at once and joins the two RBphased blocks for a single sample at once using overlapping phase states in other samples of the pool. Thus it is possible to solve the haplotype breakpoints by transfering the phase information between samples that belong to same family, populaton or species. *So, if we have a pool of several samples that have RBphased haplotypes, each sample will have several non-overlapping breakpoints among the samples. We can solve the phase state between two consecutive blocks (of one sample at a time) by computing the likelyhood of the phase states (parallel vs. alternate configuration). Likelihood of each configuration is computed by using the observed phase state in other samples that bridge that breakpoint position.* In **phASE-Extender** we use first order markov chain to compute the relationship between two consecutive haplotype blocks by preparing first order transition matrix from all the nucleotides of former haplotype block to all the nucelotides in the later haplotype block. We then compute cumulative likelyhood estimates of haplotype states for both the possible configuration. The phase state of two consecutive haplotype blocks is then extended if the `computed log2 (likelihood)` of either configuration is above the `threshold log2(likelihood) cutoff`. Using RBphase method we can therefore use small sample size to accurately predict the proper haplotype state. +**phASE-Extender** starts with already prepared RBphased haplotype blocks of several individuals at once and joins the two RBphased blocks for a single sample at once using overlapping phase states in other samples of the pool. Thus it is possible to solve the haplotype breakpoints by transferring the phase information between samples that belong to same family, population or species. *So, if we have a pool of several samples that have RBphased haplotypes, each sample will have several non-overlapping breakpoints among the samples. We can solve the phase state between two consecutive blocks (of one sample at a time) by computing the likelihood of the phase states (parallel vs. alternate configuration). Likelihood of each configuration is computed by using the observed phase state in other samples that bridge that breakpoint position.* In **phASE-Extender** we use first order markov chain to compute the relationship between two consecutive haplotype blocks by preparing first order transition matrix from all the nucleotides of former haplotype blocks to all the nucleotides in the latter haplotype block. We then compute cumulative likelihood estimates of haplotype states for both the possible configuration. The phase state of two consecutive haplotype blocks is then extended if the `computed log2 (likelihood)` of either configuration is above the `threshold log2(likelihood) cutoff`. Using RBphase method we can therefore use small sample size to accurately predict the proper haplotype state. ## ![#f03c15](https://placehold.it/15/f03c15/000000?text=+)Data Requirements @@ -105,7 +108,7 @@ sudo pip3 install -r requirements.txt - Requires a readbackphased `haplotype file` as input and returns an extended haplotype file and other results files containing statistics on the initial vs. extended haplotype. - Optionally, haplotype reference panel (with same data structure as input haplotype) and bed file can be included to gain control over the outcome of phase extension. -Check this detailed [step by step tutorial](https://github.com/everestial/phase-Extender/wiki/phase-Extender-Tutorial) for preparation of `input files` and know-how about running `phase-Extender`. +Check this detailed [step by step tutorial](https://github.com/everestial/phase-Extender/wiki/phase-Extender-Tutorial) for the preparation of `input files` and know-how about running `phase-Extender`.
@@ -118,7 +121,7 @@ The sample name should not contain "`_`" character. ***haplotype reference panel (optional):*** Haplotype reference panel as text file (should have same data structure as input haplotype file). \ To convert the haplotype reference panel (from VCF to proper text format) use **Step 01 (b)** in the tutorial. -***bed file (optional):*** If you goal is to limit phase extension to certain genomic regions (for eg. gene, exon or QTL boundries), we suggest that you provide appropriate bed file. **phase-Extender** is exclusively limited to internal boundries of bed regions. +***bed file (optional):*** If your goal is to limit phase extension to certain genomic regions (for eg. gene, exon or QTL boundries), we suggest that you provide appropriate bed file. **phase-Extender** is exclusively limited to internal boundries of bed regions. # structure of the bed file contig start end # this header is not included @@ -141,11 +144,11 @@ To convert the haplotype reference panel (from VCF to proper text format) use ** * **--python_string** _(python3)_ - Calls `python 3` interpreter to run the program. * **--output** _(SOI_extended)_ - Output directory. * **--snpTh** _(3)_ - snp threshold. Minimum number of SNPs required in each consecutive haplotype block to run phase extension between two blocks. -* **--numHets** _(40)_ - num of heterozygotes. Maximum number of heterozygote SNPs used from each consecutive block to compute maximum likelihood estimate of each configuration between two blocks. -* **--culLH** _(maxPd)_ - cumulation of the likelihood estimates. The likelhoods for two possible configuration can either be "maxed as sum" or "maxed as product". ***Default*** is "max-product". ***Options:*** 'maxPd' or 'maxSum'. +* **--numHets** _(40)_ - num of heterozygotes. Maximum number of heterozygotic SNPs used from each consecutive block to compute maximum likelihood estimate of each configuration between two blocks. +* **--culLH** _(maxPd)_ - cumulation of the likelihood estimates. The likelihoods for two possible configuration can either be "maxed as sum" or "maxed as product". ***Default*** is "max-product". ***Options:*** 'maxPd' or 'maxSum'. * **--lods** _(5)_ - log2 of Odds cut off threshold. The cutoff threshold used to extend consecutive haplotype blocks. **`**Note: Default value is set at (2^5 = 32 times likely). So, two consecutive blocks will be joined in parallel configuration if computed log2(likelihood) > lods threshold ** * **--useSample** _(all)_ - Samples to use in the given input haplotype file (plus reference haplotype) to compute transition matrix. Options: 'all','refHap','input','comma separated name of samples'. Default: all the samples in (refHap + input) will be used. -* **--bed** - Process the haplotype extension only within this bed regions. ***This is useful if you want to limit haplotype extension only within certain regions, like - within genes, exons, introns, QTL boundries, etc.*** +* **--bed** - Process the haplotype extension only within this bed region. ***This is useful if you want to limit haplotype extension only within certain regions, like - within genes, exons, introns, QTL boundries, etc.*** * **--writeLOD** _(no)_ - writes the calculate LODs between two consecutive haplotype blocks when processing phase extension to the output file. **Options:** 'yes', 'no'. **`**Note: the 'lods-score' are printed regardless if the " "consecutive blocks are joined or not.** * **--hapStats** _(no)_ - Prepare descriptive statistics, and histogram of the haplotype size distribution of the input haplotype file vs. extended haplotype for the sample of interest. **Options:** 'yes', 'no' @@ -188,7 +191,7 @@ Contains data from the sites that have unphased or missing GT (genotype) for the
### extended_haplotype_"SOI_"allsites.txt -This file contains ReadBackPhased haplotype after phase extension concated with the missing data. This file contain equal number of row as input haplotype file and data only for sample of interest. +This file contains ReadBackPhased haplotype after phase extension concatenated with the missing data. This file contains equal number of row as input haplotype file and data only for sample of interest.

@@ -225,7 +228,7 @@ Histogram of the distribution of the haplotype size (by genomic range of the hap phase-extender uses first-order-transition probabilities from each level of genotypes from former haplotype block to each level of genotypes to later haplotype block. This version (v1) uses **forward-1stOrder-markov chains** and **backward-1stOrder-markov chains** transition probabilities. Future versions will follow improvements by adding markov-chains of higher order. **_2) What is the advantage of using phase-extender ?_** - We generally need accurate phase state with in a gene/transcript level while doing ASE, dissecting maternal-paternal effects. Long haplotypes are mostly important while preparing diploid genome, testing selective sweeps within a QTL regions etc. For emerging organism systems where genotype data are sparse CW (chromosome wide), GW (genome wide) haplotypes are more difficult to solve. Also, haplotype phasing may be more complicated in out crossing individuals and hybrids due to heterogenity.\ RBphasing actually provides an advantage with heterogenous genome because frequency of RBphased blocks increases with heterogenity in the genome. These short haplotype fragments have multiple heterozygous variants on a short haplotype block. With increase in the size of `PE` (paired end) reads the size of RBphase blocks are also increases. phase-extender comes handy at this stage; that it tries to solve phase state of two consecutive blocks from one sample at a time by using data from haplotype blocks of other samples that bridge that breakpoint. So, we can solve haplotype configuration for SOI (sample of interest) with more confidence because: + We generally need accurate phase state with in a gene/transcript level while doing ASE, dissecting maternal-paternal effects. Long haplotypes are mostly important while preparing diploid genome, testing selective sweeps within a QTL regions etc. For emerging organism systems where genotype data are sparse CW (chromosome wide), GW (genome wide) haplotypes are more difficult to solve. Also, haplotype phasing may be more complicated in out crossing individuals and hybrids due to heterogeneity.\ RBphasing actually provides an advantage with heterozygous genome because frequency of RBphased blocks increases with heterogenity in the genome. These short haplotype fragments have multiple heterozygous variants on a short haplotype block. With increase in the size of `PE` (paired end) reads the size of RBphase blocks are also increases. phase-extender comes handy at this stage; it tries to solve phase state of two consecutive blocks from one sample at a time by using data from haplotype blocks of other samples that bridge that breakpoint. So, we can solve haplotype configuration for SOI (sample of interest) with more confidence because: - we have more variants within each blocks contributing to more information. - we only need to solve two possible phase state at one time compared to 2^n haplotype when reading one SNP at a time. phase-Extender also provides a more flexible and manipulative control over how to proceed with phase extension. It is also possible to control several parameters like `lods`, `snpTh`, `numHets`, `culLH`, `bed`, `useSample` to observe and compare how phase extension changes. @@ -234,12 +237,12 @@ Histogram of the distribution of the haplotype size (by genomic range of the hap Yes, but it is conditional. The InDels should already be reliably readbackphased to a haplotype block. That way when the haplotype is being extended for those SNPs, InDels hitchhike with it and get extended too. **_4) What is the minimal size of the haplotype block that is required?_** - The bigger the two haplotypes are, the better is the likelyhood test of which haplotype is phased with which. By, default I have kept this number to 3 variants (SNPs exclusive) per haplotype block that needs extension. + The bigger the two haplotypes are, the better is the likelihood test of which haplotype is phased with which. By, default I have kept this number to 3 variants (SNPs exclusive) per haplotype block that needs extension. **_5) Does phase-extender do GW (genome wide) or CW (chromosome wide) haplotype phasing_**? There are certain situation when phase-extender is able to do GW or CW haplotype phasing. A) If you have lots of samples where the haplotypes breakpoint in one sample is bridged by other samples, such that breakpoint gets solved with each recursive application of `phase-Extender` then it is possible to obtain CW and GW haplotype. - In this case we can run phase-extender for each sample there by extending the haplotype to certain extent. After this phase-extender can be applied recursively on the updated data each time, there by extending the haplotypes for each sample to full chromosome length and possibly to to full genome wide length. There is greater likelyhood of obtaining GW phase if samples are sequenced at higher coverage, increase in paired-end sequence length, availability of large sequence reads like pac-bio reads. + In this case we can run phase-extender for each sample there by extending the haplotype to certain extent. After this phase-extender can be applied recursively on the updated data each time, there by extending the haplotypes for each sample to full chromosome length and possibly to to full genome wide length. There is greater likelihood of obtaining GW phase if samples are sequenced at higher coverage, increase in paired-end sequence length, availability of large sequence reads like pac-bio reads. B) Another situation when GW, CW phase extension might be possible is when you have at least few samples which have haplotype resolved at GW/CW level. These can include fully phased data like genome matrix file, fully phased VCF data, fully phased haplotype reference panel. For this the fully phased sample should be provided as one single blocks in the group of sample that is piped to `phase-extender`. **_6) Does phase-extender phase non readbackphase SNPs_**? @@ -249,17 +252,17 @@ Histogram of the distribution of the haplotype size (by genomic range of the hap No, it does not. It is a possible future update. **_8) Does phase-extender use haplotype reference panel_**? - Yes, it does. Thought, the VCF (haplotype reference panel) should be convert to appropriate haplotype file. + Yes, it does. Though, the VCF (haplotype reference panel) should be converted to appropriate haplotype file. **_9) Does phase-extender use recombination into account_**? No and possibly these feature will be of least importance in phase-extender. Main objective of `Phase-Extender` is to join already phased short consecutive haplotype blocks with in a sample by using the relationship of the variants at those sites in several other samples. These haplotypes which are phased in other samples but has breakpoint in SOI are used to build transition probabilities. There is an assumption that recombination is less likely to occur exactly at that breakpoint or near it. So, most of the variation in haplotype among samples around the break point are not the result of recent recombination but only mutation. **_10) Does phase-extender phase rare genotypes_**? - Yes, it does. But, the rare genotype should be the readbackphased to the short haplotype blocks. This is one of the advantage of `phase-extender` compared to other tools when it come to phasing rare genotype. When a single SNPs is used singly to phase into a haplotype, rare genotypes are really hard to phase accurately - the reason being the statistical significance of the rare genotype belonging to either two phase state is highly ambigous. But, if the rare genotype is attached to a haplotye block supported by several read-back phased genotypes, this makes phasing of rare genotypes most accurate, since likelyhoods are provided by other SNPs that are not rare. + Yes, it does. But, the rare genotype should be the readbackphased to the short haplotype blocks. This is one of the advantage of `phase-extender` compared to other tools when it come to phasing rare genotype. When a single SNPs is used singly to phase into a haplotype, rare genotypes are really hard to phase accurately - the reason being the statistical significance of the rare genotype belonging to either two phase state is highly ambigous. But, if the rare genotype is attached to a haplotye block supported by several read-back phased genotypes, this makes phasing of rare genotypes most accurate, since likelihoods are provided by other SNPs that are not rare. **_11) How fast is phase-extender_**? phase-extender is written in python-3, so it is comparatively slower than other tools that are built on the top of C, C++ or java. - Coming from a pure biology background, learning python was one of the most enduring task I have taken and then building this tool was a big part of my PhD. I have optimized the part of calling VCF file using cyvcf2 (which is on average 4 times faster than old pyVCF module). phase-extender is also optimized for being able to run on multiple threads/process. But, if you are running phase-extender on big genome data and have very large number of samples, and running on laptop I suggest running on one thread, which may be time consuming but will reduce memory burden. + Coming from a pure biology background, learning python was one of the most enduring task I have taken and then building this tool was a big part of my PhD. I have optimized the part of calling VCF file using cyvcf2 (which is on average 4 times faster than the old pyVCF module). phase-extender is also optimized for being able to run on multiple threads/process. But, if you are running phase-extender on big genome data and have very large number of samples, and running on laptop I suggest running on one thread, which may be time consuming but will reduce memory burden. **_12) Does phase-extender do trio based phase extension_**? No, it does not. It is a possible future update. @@ -279,7 +282,7 @@ Histogram of the distribution of the haplotype size (by genomic range of the hap ## ![#f03c15](https://placehold.it/15/f03c15/000000?text=+)Acknowledgement I have not been very fortunate to surround myself or at least get face to face help from savvy computer programmers. But, my heart is very thankful to people behind the web who have made me capable of working this problem out. **Thanks to many people on biostars, stackoverflow, seqanswer and google web searches who provided feedback on small question that were the part of `phase-Extender` project.** - Should anyone be interested in futher improving this project via improvments on alrorithm and programming, I would be more than happy to. + Should anyone be interested in futher improving this project via improvements on algorithm and programming, I would be more than happy to.

diff --git a/docs/READme.md b/docs/READme.md index 8cc869d..b73900b 100644 --- a/docs/READme.md +++ b/docs/READme.md @@ -1,17 +1,17 @@ -This tutorial provides hands on for preparing required `input files` and running `phase-Extender`. **phase-Extender** is run with three different example sets and test files for the runs are provided in the appropriate links. +This tutorial provides a hands on for preparing required `input files` and running `phase-Extender`. **phase-Extender** has been provided with three example sets highlighting possible use cases. ## Part-A: Run phase extender. ### Step 01: Prepare input files. - **A) Convert VCF to haplotype file:** + **A) Convert VCF to a haplotype file:** ``` python3 vcf_to_table-v3.py --mode VcfToHap --PI PI --PG PG --vcf RBphased_file.vcf --out haploype_file.txt ``` **Note:** - - If RBphase information is represented by FORMAT fields other than "PI" and "PG", then it can be replaced accordingly. - - `"--unphased yes"` can be added to `"vcf_to_table-v3.py"` to include the unphased genotypes. This parameter will not affect the phase extension, and is only included to keep the whole data intact. - - run command `python3 vcf_to_table-v3.py --help` for more details on VCF and haplotype reference panel to table conversion. + - If RBphase information is represented by FORMAT fields other than "PI" and "PG", then it will have to be placed accordingly. + - `"--unphased yes"` switch can be added to `"vcf_to_table-v3.py"` to include the unphased genotypes. This parameter will not affect the phase extension, and is only included to keep the entire data-set in the conversion. + - run command `python3 vcf_to_table-v3.py --help` for a detailed help section or manual to convert into the requisite tabular format from a VCF or Haplotype panel.
@@ -23,7 +23,7 @@ python3 vcf_to_table-v3.py --mode RefPanelToHap --PI CHROM --PG GT --vcf RefPane
### Step 02: Run phase-Extender. -Parameters that are not called are set at default value. +Parameters that are not called are retained at default values. - **Call for help -** ``` @@ -46,7 +46,7 @@ python3 phase_extender_v1-final.py --input input_haplotype_file.txt --SOI ms02g - **Example test case 02 (multiple cases) -**\ Use data from [example 02](https://github.com/everestial/phase-Extender/tree/master/example02) ``` -# use 2 processes +# There are two steps to run case 02. # use 25 heterozygote SNPs (from each block) for ... # ... preparing transition matrix between two consecutive blocks # set lods2 cutoff threshold at 10 @@ -55,7 +55,6 @@ python3 phase_extender_v1-final.py --nt 2 --input haplotype_file_test02.txt --SO # Output is stored in directory `ms02g_extended\`. - # to assign output to a different directory python3 phase_extender_v1-final.py --input haplotype_file_test02.txt --SOI ms02g --output my_test # Output is stored in directory `my_test\`. @@ -119,7 +118,7 @@ contig pos ref all-alleles ms02g_PI ms02g_PG_al log2odds
**What does this tell us?** - - Our cutoff threshold was `10` and the computed lods between the two blocks is `-73.3033`. + - The lod cutoff parametrized in the command is an absolute value. Our cutoff threshold was `10` and the computed lods between the two blocks is `-73.3033`. - Since, the **|computed lods|** > **lods threshold** phase-Extender proceeds with phase extension between blocks with PI (6 and 4). - Since, the **compute lods** is a negative value, the phase is exteded in alternate configuration. - So, if **|computed lods|** < **lods threshold** the phase state won't extend between two consecutive blocks. @@ -128,7 +127,7 @@ contig pos ref all-alleles ms02g_PI ms02g_PG_al log2odds
### **2. Interpreting histogram plots (from output of test case 03) -**\ -So, when phase-Extender runs it will join two consecutive haplotype and will increase the size of the haplotype by number of variants within each haplotype, and also by length of the haplotype. Inversly, the total number of haplotype will decrease. This change can be compared by observing changes in the shape of the histogram (size by number of the haplotype) before vs. after phase extension. +So, when phase-Extender runs it will join two consecutive haplotype and will increase the size of the haplotype block by number of variants within each haplotype, and also by length of the haplotype. Inversly, the total number of haplotype will decrease. This change can be compared by observing changes in the shape of the histogram (size by number of the haplotype) before vs. after phase extension. **Histogram of the haplotype size distribution before phase extension (contig 2 & 3)** ![beforephaseextension!](https://github.com/everestial/phase-Extender/blob/master/example03/hap_size_byVar_ms02g_initial.png) @@ -143,9 +142,9 @@ So, when phase-Extender runs it will join two consecutive haplotype and will inc
# Recursive application of haplotype phase extension - - ***phase-Extender maynot be able to prepare a full length phased haplotype in one run.*** This is not the limitation but rather a intended feature in this tool. The reason is to provide flexibility and allow the user to control phase extension. A controllable haplotype extension is largely required for phase extension in emerging research model owing to smaller genotype samples, absence of reference panel and higher heterozygosity in the genome.\ - - The main idea is to first run **phase-Extender** with higher `log2Odds cut off` for several samples. Then merge the output of each sample to run another round of **phase extension** with concessive (lower) `log2Odds cut off`. When applied recursively we reduce the number of haplotypes and increase the length of haplotypes in each run. - - To achieve full genomewide haplotype phasing there should be enough haplotype blocks (from other samples) bridging the haplotype breakpoint in the sample of interest. + - ***phase-Extender may not be able to prepare a full length phased haplotype in one run.*** This is not the limitation but rather an intended feature in this tool. The reason is to provide flexibility and allow the user to control phase extension. A controllable haplotype extension is largely required for phase extension in emerging research models owing to few genotype samples, absence of reference panel and heterozygosity in the genome.\ + - The main idea is to first run **phase-Extender** with higher `log2Odds cut off` for several samples. Then merge the output of each sample to run another round of **phase extension** with lower `log2Odds cut off`. When applied recursively we reduce the number of haplotypes and increase the length of haplotypes in each run. + - To achieve genome wide haplotype phasing there should be enough haplotype blocks (from other samples) bridging the haplotype breakpoint in the sample of interest. - So, controllable haplotype extension is a novel feature intended in **phase-Extender**. A full length recursive phase extension is illustrated in this link [phase Extender on recursive mode](some website??) using the bash script. **(coming soon !)**