THIS IS CURRENTLY UNDER ACTIVE DEVELOPMENT

Data Cleaning Pipeline

This is a simple data cleaning pipeline built with Snakemake for the binary PLINK file set: .bed, .bim, and .fam. The pipeline automates the data cleaning process to ensure reproducibility. It assumes that the files are in binary PLINK format with phenotypes such as sex and case/control status already added.

Requirements

Installation

1. Clone the Repository

First, clone the repository to your local machine:

git clone https://github.com/Ax-Sch/smk_plink_QC 
cd smk_plink_QC

2. Set Up the Conda Environment

Create and activate the conda environment:

conda env create -f workflow/envs/snakemake8.yaml
conda activate snakemake8

3. Usage

Configure Input Files and Parameters

All configuration options can be found in the config.yaml file within the directory config. This file allows you to customize paths, parameters, and settings to fit your specific needs. For example, you can set the path to your input files, and Human genome reference version:

input_plink: "path/to/your/input_files/files.fam"
genome_ref:
    version: b37 or b38

An toy data set is located within the folder example_data to test if the pipeline is working in principle.

As mentioned earlier, the input files for this pipeline are genotype data in .bim, .bed, and .fam formats, with sex information and case/control status already included as phenotypes.

As part of the pipeline, Principal Component Analysis (PCA) is performed to filter out individuals who do not fall within the defined ranges for the ancestry you are interested in. The boundaries for the first two principal components (PC1 and PC2) are specified in the configuration file under the section "pca_ancestry_filters:". These ranges can be adjusted as necessary to meet the requirements of specific analyses.

Run the Pipeline

Perform a dry run to ensure everything is set up correctly:

snakemake -np

If the dry run is successful, run the pipeline using the following command (replace 1 with the number of cores you want to use):

snakemake --cores 1 --use-conda --conda-frontend conda

4. Output

All output directories and their corresponding snakemake rules have a capital letter as prefix (A-Z) to separate the corresponding steps:

A_Prepare_correct_x
B_VariantCallrate1
C_SampleCallrate
D_Filter_sex_checked
D_Graph_Sex_check
D_Sex_check
E_Check_heterozygosity
E_Filter_het_samples
E_Get_heterozygosity
F_VariantCallrate2
G_Check_MissDiff_HWE
G_Filter_MissDiff_HWE
G_Get_MissDiff_HWE
H_Change_ID_for_1000G_PCA
H_Download_1000G_chromosomes
H_Download_1000G_sample_info
H_Download_fasta_files
H_Filter_plink_for_ancestry
H_Make_PCA_plots
H_Merge_data_w_1000G_run_PCA_step3
H_Prepare_1000G_for_ancestry_PCA_step1
H_Prepare_1000G_for_ancestry_PCA_step2
H_Run_pca_filter
H_Unzip_fasta
I_Kinship_analysis
I_Kinship_check2
I_Kinship_check2_R
I_kinship_analysis_R
I_kinship_scatter_plot
I_remove_relateds

5. Acknowledgements

I would like to thank my supervisors, Dr. Axel Schmidt (@Ax-Sch) and Dr. Kerstin Ludwig, for their guidance and support. Additionally, I would like to thank the developers of the software used within this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
config		config
example_data		example_data
workflow		workflow
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
rule_graph.png		rule_graph.png
run.slurm		run.slurm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

THIS IS CURRENTLY UNDER ACTIVE DEVELOPMENT

Data Cleaning Pipeline

Table of Contents

Requirements

Installation

1. Clone the Repository

2. Set Up the Conda Environment

3. Usage

4. Output

5. Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

THIS IS CURRENTLY UNDER ACTIVE DEVELOPMENT

Data Cleaning Pipeline

Table of Contents

Requirements

Installation

1. Clone the Repository

2. Set Up the Conda Environment

3. Usage

4. Output

5. Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages