XVir: A Transformer-Based Architecture for Identifying Viral Reads from Cancer Samples

Authors: Shorya Consul, John Robertson, Haris Vikalo

Requirements

The YAML file environment.yml specifies the dependencies reqired to run XVir and competing benchmarks (DeepVirFiner and Virtifier). To create the environment run conda env create -f environment.yml

File Structure

- utils/
  - __init__.py             : Needed for python packaging for `utils`
  - dataset.py              : Script to create dataset for XVir model
  - collate_data.py         : Script for creating Pickle data objects from .txt files
                              for numerically encoded reads
  - general_tools.py        : General tools for arguments, and backup while training
  - train_tools.py          : Tools for training
  - sample_data.py          : Subsampling reads from given read FASTA file or Pickle object
  - train_test_val_split.py : Script to take input read FASTA files and write output splits
                              into individual FASTA files
  - visualize_data.py       : Script to create t-SNE and MDS visualizations of reads
  - fastq2fasta.sh          : Bash script to create FASTA files corresponding to input FASTQ files
  

- data/                 : Place to store data. Additional documentation can be found in the README
                          included in the _data_ directory.
- logs/                 : Place where logs and model weights will be saved. 
                          The model weights for the 150bp model have been included. 
- model.py              : XVir model specification
- main.py               : Main script
- trainer.py            : Script for trainer. Invoked whenever training XVir.
- __init__.py           : Needed for python packaging
- environment.yml       : Dependencies for environment

- setup.sh              : Script to set up environment variables
- visualize_results.py  : Visualize results of chosen model

- LICENSE
- README.md

To Run

Set up the required environment variables by runing source setup.sh.

Inference Only

To use a trained XVir model for inference, we've included an inference.py script. We have also provided the model weights for the base 150bp model in the /logs/ folder. Given a FASTA file with 150bp reads, you may call it as:

python inference.py --model_path=./logs/XVir_150bp_model.pt --input=./path/to/fasta.fa

This will create a fasta.fa.output.txt in the same location as the input, containing the name of each read along with the probability that the read is HPV positive. For other models, you can also specify the flags --read_len, --ngram, --model_dim, and --num_layers (as in main.py) Inference batch size can be changed from the default (100) with --batch_size and GPU can be enabled by passing --cuda

Alternatively, the user can use the eval-only flag to run inference on XVir. See commands.sh for an example of this. This offers greater flexibility in terms fo the format of the input files.

Training XVir on User Data (Recommended)

The script main.py is the primary entry point for the XVir pipeline. It includes the functionality for training, testing, and validating an XVir model on custom data.

python main.py <args>

For example, when specifying training, test and validation sets, XVir can be trained by running python main.py -s --train-data-file train_data.pkl --val-data-file val_data.pkl --test-data-file test_data.pkl --data-path data/ --device cuda

To prepare your data for training, please see the tools we have provided in the data folder.

Command line arguments

The command line options for XVir are outlined below. The default values of these arguments, used to create our XVir model, can be found in utils/general_tools.py.

Argument	Description	Default
--data-path	The path to load data	'data'
--data-file	The relative path of data file from data-path	'proc_data.pkl'
--train-data-file	Relative path of training data file	('split/train_data.pkl'
--val-data-file	Relative path of load validation data file	'split/val_data.pkl'
--test-data-file	Relative path to load validation data file	'split/test_data.pkl'
--train-split	Fraction of data to use for training	0.8
--valid-split	Fraction of data to use for validation	0.1
--experiment-name	Name of the experiment	'XVir'
--device	What to use for compute [GPU, CPU] will be called.	'cuda' (Can specify 'cuda:[int]')
--seed	Random seed	4
--read_len	Read length	150
--ngram	Length of k-mer	6
--model-dim	The embedding dimension of transformer	128
--num-layers	The number of layers	1
--batch-size	The batch size	100
--dropout	Dropout rate (only for training)	0.1
--mask-rate	Masking rate (only for training)	None
--n-epochs	Number of epochs	25
--learning-rate	Learning rate	0.001
--weight-decay	Weight decay rate	1e-6
--eval-only	Only evaluate the model	N/A
-s	Passing splits for training and testing	N/A
--load-model	Load model	False
--model-path	Relative path to load model	logs/experiment/XVir.pt'
--model-save-interval	How often (in epochs) to save the model	5
--model-update-interval	How often (in epochs) to update the model	2
--model-save-path	Directory to save the trained model	'logs/experiment/XVir_models'
--print-log-interval	How often (in epochs) to print training logs	1
--val-log-interval	How often (in epochs) to print validation logs	5

Citation

If you use this software, please cite:

Consul, S., Robertson, J., & Vikalo, H. (2023). XVir: A Transformer-Based Architecture for Identifying Viral Reads from Cancer Samples. bioRxiv, 2023-08.

The correspodning BibTex is:

@article{consul2023xvir, title={XVir: A Transformer-Based Architecture for Identifying Viral Reads from Cancer Samples}, author={Consul, Shorya and Robertson, John and Vikalo, Haris}, journal={bioRxiv}, pages={2023--08}, year={2023}, publisher={Cold Spring Harbor Laboratory} }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

XVir: A Transformer-Based Architecture for Identifying Viral Reads from Cancer Samples

Requirements

File Structure

To Run

Inference Only

Training XVir on User Data (Recommended)

Command line arguments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
data		data
logs		logs
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
commands.sh		commands.sh
environment.yml		environment.yml
inference.py		inference.py
main.py		main.py
model.py		model.py
old-environment.yml		old-environment.yml
ood_data_analysis.py		ood_data_analysis.py
setup.sh		setup.sh
trainer.py		trainer.py
visualize_results.py		visualize_results.py

Folders and files

Latest commit

History

Repository files navigation

XVir: A Transformer-Based Architecture for Identifying Viral Reads from Cancer Samples

Requirements

File Structure

To Run

Inference Only

Training XVir on User Data (Recommended)

Command line arguments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages