Genotype QC to do
I'm (@dmccormi ) submitting this as a single issue as each point requires the context of the other points.
1. QCed genotypes will not just be used for GWAS.
- The pipeline is almost entirely GWAS focused in its approach. Please separate out the genotype QC and GWAS parts of the pipeline into separate pipelines.
- Genotype QC can mostly remain as is for the combine genetic variants part (although please see the Relatedness checks point below).
- I will handle the imputation upload etc.
- Please restart the pipeline after imputation, taking the outputs of topmed imputation as input.
- The output of the Genotype QC pipeline should be the complete set of imputed, QCed genotypes as well as a table of the related individuals for export filtering (see below).
- There are essential steps, dictated by our Information Governance principles, that need to be made between genotype QC and downstream analyses. The outputs of the genotype QC pipeline will feed through this process, with project-specific exports made at the end which can then be put into the downstream GWAS pipeline, in your case. Covariates are generated during these steps; an export will contain both the data and the covariates.
2. Relatedness checks
- Removing all related individuals during the genotype QC steps only makes sense if everyone is going to be analysed together.
- We have many different phenotypes; we may end up removing related individuals where, for example, one is in the COVID cohort and another in the hepatitis cohort. In addition, the lab does many analyses other than GWAS for which we often need genotypes - removing a genotype from a small set because an individual is related to someone with a completely different phenotype in a different analysis should not be done.
- The relatedness output should generate a table / matrix which shows which genotypes must not be used together in a single analysis. I will then use this table / matrix to filter genotypes during the export process (which is project-directed).
- If this will be too difficult, then I propose we move the relatedness step from the Genotype QC pipeline to the more downstream project export step, which will then only remove related individuals in a project subset.
3. Array inputs
- Some of our collaborators (and our other open-source inputs) will / do use other, different types of arrays than our core GenOMICC UK arrays.
- It would be more useful if the Genotype QC pipeline could take a list or table of array paths to pass into the Combine Genetic Variants step.
4. Resilience of pipeline
- Once the pipeline is finished it needs to be modified to be coded in a commonly used language, preferably Python.
Genotype QC to do
I'm (@dmccormi ) submitting this as a single issue as each point requires the context of the other points.
1. QCed genotypes will not just be used for GWAS.
2. Relatedness checks
3. Array inputs
4. Resilience of pipeline