This is a fork of sojwolf/ancientpca, with a bugfix for the plotting function, and some minor modifications.
A Probabilistic Approach to Conducting PCA with Sparse Genotype Data from Ancient Samples
if (!requireNamespace("devtools", quietly=TRUE))
install.packages("devtools")
devtools::install_github("sgrote/ancientpca")
- Genotypes: vcf-file with genotypes only (i.e. "0/0", "0/1", etc.). Missing values can be coded as "." or "./."
- Meta Data: csv-file that contains headers "Sample_ID" and "Population" (can contain other columns too) for at least all samples in your vcf-file.
library(ancientpca)
- Load vcf-file into R and subsample, if need be by setting the maximum allowed missingness per sample and SNP. The default is set to 1, i.e. 100%. The vcf-file can be zipped.
gt <- vcfToGenotypeMatrix(vcf_file="vcf_file.vcf", max_missing_snp=0.9, max_missing_sample=0.95)
- Impute missing genotypes using package softImpute
gt_imputed <- impGenotypeMatrix(genotype_matrix=gt)
- Calculate PCA using R's prcomp function
pca_gt <- prcomp(gt_imputed, center = TRUE, scale. = TRUE)
- Plot PCs 1-6 and save output to PDF file.
plotImpPCA(pca_obj=pca_gt, original_matrix=gt, meta_file="meta.csv", output_pca_pdf="output.pdf")