SparkBeagle is distributed implementation of the widely adopted Beagle imputation tool to boost and to scale out genotype imputation from Whole-genome sequencing panels in the cloud environment.
CentOS Linux 7(recommended), Java 8, Apache Maven 3.x, Hadoop 3.x.x, Spark 2.x.x, HDFS, tabix, bgzip
git clone https://github.com/NGSeq/SparkBeagle
cd SparkBeagle
mvn install package
Binary file is created including all needed library dependencies under target/sparkbeagle-x.x-jar-with-dependencies.jar
Automated installation with Hadoop distributions such as Hortonworks HDP or Cloudera CDH.
or
Manually with minimal configuration by following the instructions in https://github.com/NGSeq/SparkBeagle/test/hadoop-init.sh and running the script.
If root user does not have home folder in HDFS:
su hdfs
hdfs dfs -mkdir -p /user/root
chown root:root /user/root
exit
cd test
Data is found under data/chr22_subset folder
hdfs dfs -put data/chr22_subset chr22_subset
You might need to modify hdfs://localhost:8020 to match the namenode of your cluster. Check also the class path to jar file. Read https://spark.apache.org/docs/latest/submitting-applications.html if you want to change Spark master or deploy mode.
spark-submit --master yarn --deploy-mode cluster
--executor-memory 4g
--driver-memory 4g
--executor-cores 8
--conf spark.dynamicAllocation.enabled=true
--conf spark.scheduler.mode=FAIR
--conf spark.shuffle.service.enabled=true
--conf spark.executor.memoryOverhead=1000
--conf spark.port.maxRetries=100
--class org.ngseq.sparkbeagle.Main ../target/sparkbeagle-0.5-jar-with-dependencies.jar
-rheader /user/root/chr22_subset/ref.header
-ref /user/root/chr22_subset/chr22ref.vcf.gz
-gt /user/root/chr22_subset/chr22target.vcf.gz
-map /user/root/chr22_subset/plink.chr22.GRCh37.map
-index /user/root/chr22_subset/chr22ref.vcf.gz.tbi
-out /user/root/chr22_subset/imputed
-nthreads 4
-chrom 22
-hdfs hdfs://localhost:8020
hdfs dfs -getmerge /user/root/chr22_subset/imputed imputed_chr22_subset.vcf.gz
Run preprocess.sh first to download 1000 Genomes reference panels and create target data. Downloads also genetic maps.
./preprocess.sh
./save2hdfs.sh
./run_test.sh
hdfs dfs -getmerge /user/root/imputed imputed.vcf.gz
or whole directory as is
hdfs dfs -get /user/root/imputed
mkdir reference target genmap
Download genetic map files from http://bochet.gcc.biostat.washington.edu/beagle/genetic_maps/ and unzip to folder named "genmap".
for i in {1..22};do bgzip reference/chr${i}.vcf &; done
for i in {1..22};do bgzip target/chr${i}.vcf &; done
for i in {1..22};do tabix -pvcf reference/chr${i}.vcf.gz &; done
hdfs dfs -put reference
hdfs dfs -put target
hdfs dfs -put genmap
zcat reference/chr1.vcf.gz | head -n253 > ref.header
hdfs dfs -put ref.header
./run.sh
hdfs dfs -getmerge imputed imputed.vcf.gz