PredVirusHost

PredVirusHost is a series of tools for predicting the host domain of contigs from viromes (such as metagenomes). It does this by comparing proteins from a number of contigs to hmm models and then scores each contig based on similarity to those models.

PredVirusHost is built in Python 3.12 and requires the HMMER suite.

Install

Install dependecies

Create environment

Use Conda/Mamba (micromamba is small and fast)

conda create --name predvirushost -c conda-forge -c defaults python pip
conda activate predvirushost

Download this repo
Install PredVirusHost

cd PredVirusHost
pip install .

Download and extract models

tar xvf models-v0.2.0.tgz

Running PredVirusHost

PredVirusHost can be run from the command line or from a jupyter notebook.

Command line

Running PredVirusHost on command line can be done with run_predvirushost.py. This will require an input fasta file -i fasta_file.faa, an output directory -o output_directory/, the location of the models --data_path models-v0.2.0, and the fasta file protein name format (see below for details) --format RefSeq.

run_predvirushost.py -i /path/to/input/fasta_file.faa \
                     -o /path/to/output/directory \
                     --data_path /path/to/models/directory \
                     --format <fasta protein name format [RefSeq, GenBank, PROKKA, MGRAST, Other]>
                     --n_min <Minimum number of proteins a genome must have to be included> \
                     --n_cpus <Number of cpu cores to run on>

By default, PredVirusHost will check if there are already output files in the specified output directory in order to prevent overwriting or appending existing files. If you want to overwrite existing files this can be done with --delete (THIS WILL REMOVE ALL FILES named fastafile_*.faa, data_*.pkl, and short_proteins_*.pkl) which will prompt for the removal of files. To skip the prompt --forcedelete can be added added as well.

The slowest step in the pipeline is hmmsearch which is unavoidable due to what hmmsearch has to do. PredVirusHost can be split into steps so that data can be preprocessed, and the results data processed, while the hmmsearch can be done separately (e.g. on a cluster). To run only the preprocessing steps (up to but not including running hmmsearch) use --preprocess_only and to run the steps after the hmmsearch use --process_results_only.

If the hmmsearch is being run separately it should be run with:

model=<one of [arVOG, euVOG, baPOG]>
file_number=<number found in fastafile_NUMBER.faa>
hmmsearch --tblout /path/to/output/directory/${model}_res_${file_number}.tbl \
          --noali --cpu <NUMBER OF CPUS TO RUN ON> \
          /path/to/data/${model}.hmm
          /path/to/fastafile_${file_number}.faa

Jupyter Notebook

Install jupyter into the current environment with pip install jupyterlab.

Import modules

import predvirushost.utils.ProcessInput as pvhin
import predvirushost.utils.ProcessResults as pvhres
import predvirushost.utils.AnalyseResults as analyse

provide input data for PredViruHost

input_d = {'input_file': '/path/to/fastafile.faa', 
                        'output_path': '/path/to/output/directory', 
                        'data_path': '/path/to/models-v0.2.0', 
                        'format': 'RefSeq',
                        'n_cpus': 1,
                        'n_min': 5,
                        'verbose': None,
                        'process_results_only': False,
                        'preprocess_only': False}

Run preprocessing

prediction = pi.ProcessInput(args=input_d)
prediction.check_files()
prediction.process_fasta()
prediction.check_short_genomes()

## These are the hmmsearch steps and can be run separately
prediction.check_hmm()
prediction.run_hmmsearch()

Process results

results = pr.ProcessResults(input_d)
results.process_results()
results.process_genomes()
results.write_genomes(file_type='pkl')
results.write_proteins(file_type='pkl')
results.write_hosts(file_type='pkl')

View result

analysis = analyse.AnalyseResults(input_d)
analysis.view_host('Archaea')

Input format

Input files with multiple different genomes need to be preprocessed so that each protein is mapped to the correct genome. PredVirusHost does this by splitting protein names on a delimiter.

Example format (RefSeq)

FILE-NAME.faa

>protein_id[contig_name]
protein_sequence

>protein_id[contig_name]
protein_sequence

>protein_id[contig_name]
protein_sequence

Using a different format

If other is selected then PredVirusHost will prompt for:

A delimiter
- is the character that separates the contig name from the protein name/detail
A start position
- is the number of delimiters that you must pass to be left with just the contig name.
An end position
- is the number of delimiters from the end needed to be left with just the contig

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
src/predvirushost		src/predvirushost
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PredVirusHost

Install

Running PredVirusHost

Command line

Jupyter Notebook

Input format

Example format (RefSeq)

Using a different format

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

TJN25/PredVirusHost

Folders and files

Latest commit

History

Repository files navigation

PredVirusHost

Install

Running PredVirusHost

Command line

Jupyter Notebook

Input format

Example format (RefSeq)

Using a different format

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages