Skip to content

shoryaconsul/XVir

Repository files navigation

XVir: A Transformer-Based Architecture for Identifying Viral Reads from Cancer Samples

Authors: Shorya Consul, John Robertson, Haris Vikalo

Requirements

The YAML file environment.yml specifies the dependencies reqired to run XVir and competing benchmarks (DeepVirFiner and Virtifier). To create the environment run conda env create -f environment.yml

File Structure

- utils/
  - __init__.py             : Needed for python packaging for `utils`
  - dataset.py              : Script to create dataset for XVir model
  - collate_data.py         : Script for creating Pickle data objects from .txt files
                              for numerically encoded reads
  - general_tools.py        : General tools for arguments, and backup while training
  - train_tools.py          : Tools for training
  - sample_data.py          : Subsampling reads from given read FASTA file or Pickle object
  - train_test_val_split.py : Script to take input read FASTA files and write output splits
                              into individual FASTA files
  - visualize_data.py       : Script to create t-SNE and MDS visualizations of reads
  - fastq2fasta.sh          : Bash script to create FASTA files corresponding to input FASTQ files
  

- data/                 : Place to store data. Additional documentation can be found in the README
                          included in the _data_ directory.
- logs/                 : Place where logs and model weights will be saved. 
                          The model weights for the 150bp model have been included. 
- model.py              : XVir model specification
- main.py               : Main script
- trainer.py            : Script for trainer. Invoked whenever training XVir.
- __init__.py           : Needed for python packaging
- environment.yml       : Dependencies for environment

- setup.sh              : Script to set up environment variables
- visualize_results.py  : Visualize results of chosen model

- LICENSE
- README.md

To Run

Set up the required environment variables by runing source setup.sh.

Inference Only

To use a trained XVir model for inference, we've included an inference.py script. We have also provided the model weights for the base 150bp model in the /logs/ folder. Given a FASTA file with 150bp reads, you may call it as:

python inference.py --model_path=./logs/XVir_150bp_model.pt --input=./path/to/fasta.fa

This will create a fasta.fa.output.txt in the same location as the input, containing the name of each read along with the probability that the read is HPV positive. For other models, you can also specify the flags --read_len, --ngram, --model_dim, and --num_layers (as in main.py) Inference batch size can be changed from the default (100) with --batch_size and GPU can be enabled by passing --cuda

Alternatively, the user can use the eval-only flag to run inference on XVir. See commands.sh for an example of this. This offers greater flexibility in terms fo the format of the input files.

Training XVir on User Data (Recommended)

The script main.py is the primary entry point for the XVir pipeline. It includes the functionality for training, testing, and validating an XVir model on custom data.

python main.py <args>

For example, when specifying training, test and validation sets, XVir can be trained by running python main.py -s --train-data-file train_data.pkl --val-data-file val_data.pkl --test-data-file test_data.pkl --data-path data/ --device cuda

To prepare your data for training, please see the tools we have provided in the data folder.

Command line arguments

The command line options for XVir are outlined below. The default values of these arguments, used to create our XVir model, can be found in utils/general_tools.py.

Argument Description Default
--data-path The path to load data 'data'
--data-file The relative path of data file from data-path 'proc_data.pkl'
--train-data-file Relative path of training data file ('split/train_data.pkl'
--val-data-file Relative path of load validation data file 'split/val_data.pkl'
--test-data-file Relative path to load validation data file 'split/test_data.pkl'
--train-split Fraction of data to use for training 0.8
--valid-split Fraction of data to use for validation 0.1
--experiment-name Name of the experiment 'XVir'
--device What to use for compute [GPU, CPU] will be called. 'cuda' (Can specify 'cuda:[int]')
--seed Random seed 4
--read_len Read length 150
--ngram Length of k-mer 6
--model-dim The embedding dimension of transformer 128
--num-layers The number of layers 1
--batch-size The batch size 100
--dropout Dropout rate (only for training) 0.1
--mask-rate Masking rate (only for training) None
--n-epochs Number of epochs 25
--learning-rate Learning rate 0.001
--weight-decay Weight decay rate 1e-6
--eval-only Only evaluate the model N/A
-s Passing splits for training and testing N/A
--load-model Load model False
--model-path Relative path to load model logs/experiment/XVir.pt'
--model-save-interval How often (in epochs) to save the model 5
--model-update-interval How often (in epochs) to update the model 2
--model-save-path Directory to save the trained model 'logs/experiment/XVir_models'
--print-log-interval How often (in epochs) to print training logs 1
--val-log-interval How often (in epochs) to print validation logs 5

Citation

If you use this software, please cite:

Consul, S., Robertson, J., & Vikalo, H. (2023). XVir: A Transformer-Based Architecture for Identifying Viral Reads from Cancer Samples. bioRxiv, 2023-08.

The correspodning BibTex is:

@article{consul2023xvir, title={XVir: A Transformer-Based Architecture for Identifying Viral Reads from Cancer Samples}, author={Consul, Shorya and Robertson, John and Vikalo, Haris}, journal={bioRxiv}, pages={2023--08}, year={2023}, publisher={Cold Spring Harbor Laboratory} }

About

Transformer-based classifier for viral read identification in cancerous samples

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors