Skip to content

Ax-Sch/smk_plink_QC

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

THIS IS CURRENTLY UNDER ACTIVE DEVELOPMENT

Data Cleaning Pipeline

This is a simple data cleaning pipeline built with Snakemake for the binary PLINK file set: .bed, .bim, and .fam. The pipeline automates the data cleaning process to ensure reproducibility. It assumes that the files are in binary PLINK format with phenotypes such as sex and case/control status already added.

Table of Contents

Requirements

Installation

1. Clone the Repository

First, clone the repository to your local machine:

git clone https://github.com/Ax-Sch/smk_plink_QC 
cd smk_plink_QC

2. Set Up the Conda Environment

Create and activate the conda environment:

conda env create -f workflow/envs/snakemake8.yaml
conda activate snakemake8

3. Usage

  1. Configure Input Files and Parameters

All configuration options can be found in the config.yaml file within the directory config. This file allows you to customize paths, parameters, and settings to fit your specific needs. For example, you can set the path to your input files, and Human genome reference version:

input_plink: "path/to/your/input_files/files.fam"
genome_ref:
    version: b37 or b38

An toy data set is located within the folder example_data to test if the pipeline is working in principle.

As mentioned earlier, the input files for this pipeline are genotype data in .bim, .bed, and .fam formats, with sex information and case/control status already included as phenotypes.

As part of the pipeline, Principal Component Analysis (PCA) is performed to filter out individuals who do not fall within the defined ranges for the ancestry you are interested in. The boundaries for the first two principal components (PC1 and PC2) are specified in the configuration file under the section "pca_ancestry_filters:". These ranges can be adjusted as necessary to meet the requirements of specific analyses.

  1. Run the Pipeline

Perform a dry run to ensure everything is set up correctly:

snakemake -np

If the dry run is successful, run the pipeline using the following command (replace 1 with the number of cores you want to use):

snakemake --cores 1 --use-conda --conda-frontend conda

4. Output

All output directories and their corresponding snakemake rules have a capital letter as prefix (A-Z) to separate the corresponding steps:

  • A_Prepare_correct_x
  • B_VariantCallrate1 
  • C_SampleCallrate 
  • D_Filter_sex_checked 
  • D_Graph_Sex_check
  • D_Sex_check
  • E_Check_heterozygosity 
  • E_Filter_het_samples 
  • E_Get_heterozygosity 
  • F_VariantCallrate2 
  • G_Check_MissDiff_HWE 
  • G_Filter_MissDiff_HWE
  • G_Get_MissDiff_HWE 
  • H_Change_ID_for_1000G_PCA
  • H_Download_1000G_chromosomes
  • H_Download_1000G_sample_info 
  • H_Download_fasta_files 
  • H_Filter_plink_for_ancestry
  • H_Make_PCA_plots 
  • H_Merge_data_w_1000G_run_PCA_step3 
  • H_Prepare_1000G_for_ancestry_PCA_step1
  • H_Prepare_1000G_for_ancestry_PCA_step2
  • H_Run_pca_filter 
  • H_Unzip_fasta
  • I_Kinship_analysis 
  • I_Kinship_check2 
  • I_Kinship_check2_R 
  • I_kinship_analysis_R 
  • I_kinship_scatter_plot 
  • I_remove_relateds

5. Acknowledgements

I would like to thank my supervisors, Dr. Axel Schmidt (@Ax-Sch) and Dr. Kerstin Ludwig, for their guidance and support. Additionally, I would like to thank the developers of the software used within this repository.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 57.3%
  • R 41.6%
  • Shell 1.1%