This repo analyzes the spike glycoprotein sequecnces of the seasonal human coronavirus 229E by running a snakemake pipeline that downloads sequences based a list of accessions, processes the sequences, and constructs a phylogenetic Nextstrain tree that can be viewed using Auspice.
Analysis performed by Caleb Carr and Sheri Harari.
To download the accessions, go to NCBI Virus and click Search by virus. In the Search by virus name or taxonomy box, enter Human coronavirus 229E, taxid:11137 and hit enter. Then click the Download option, select Accession List and Nucleotide options and hit Next. On the next page, select Download All Records and hit Next. On the next page, select Accession with version and click Download. Sequences are downloaded from the list of accessions because more information is extracted from the genbank file during the download process. The current accession list was downloaded on Febuary 20, 2025.
The tree was rooted with Human coronavirus NL63 (Accession: NC_005831.2), which was added to the accession list downloaded above.
The pipeline can be run automatically using snakemake to run Snakefile, which reads its configuration from config.yaml. The results of the automated steps are placed in Results.
To run these steps, first build the conda environment, which installs the necessary programs. First install conda. Then build the environment.
conda env create -f environment.yml
Then activate the conda environment with:
conda activate nextstrain_analysis
and then run the pipeline on a computing cluster with slurm, which uses the configuration specified in cluster.yml:
sbatch run_snakemake_cluster.bash
Configure: Contains the files needed to configure the pipeline including the config.yaml and Input_Data which contains list of accessions, reference genomes, and outgroup references.Rules: Contains the snakemake rules used to run the pipeline.Scripts: Contains the custom python scripts used for part of the analysis.Results: Contains the automatically generated results from the pipeline analysis.