Authors: Shorya Consul, John Robertson, Haris Vikalo
The YAML file environment.yml specifies the dependencies reqired to run XVir and competing benchmarks (DeepVirFiner and Virtifier). To create the environment run conda env create -f environment.yml
- utils/
- __init__.py : Needed for python packaging for `utils`
- dataset.py : Script to create dataset for XVir model
- collate_data.py : Script for creating Pickle data objects from .txt files
for numerically encoded reads
- general_tools.py : General tools for arguments, and backup while training
- train_tools.py : Tools for training
- sample_data.py : Subsampling reads from given read FASTA file or Pickle object
- train_test_val_split.py : Script to take input read FASTA files and write output splits
into individual FASTA files
- visualize_data.py : Script to create t-SNE and MDS visualizations of reads
- fastq2fasta.sh : Bash script to create FASTA files corresponding to input FASTQ files
- data/ : Place to store data. Additional documentation can be found in the README
included in the _data_ directory.
- logs/ : Place where logs and model weights will be saved.
The model weights for the 150bp model have been included.
- model.py : XVir model specification
- main.py : Main script
- trainer.py : Script for trainer. Invoked whenever training XVir.
- __init__.py : Needed for python packaging
- environment.yml : Dependencies for environment
- setup.sh : Script to set up environment variables
- visualize_results.py : Visualize results of chosen model
- LICENSE
- README.md
Set up the required environment variables by runing source setup.sh.
To use a trained XVir model for inference, we've included an inference.py script.
We have also provided the model weights for the base 150bp model in the /logs/ folder. Given a FASTA file with 150bp reads, you may call it as:
python inference.py --model_path=./logs/XVir_150bp_model.pt --input=./path/to/fasta.fa
This will create a fasta.fa.output.txt in the same location as the input, containing the name of each read along with the probability that the read is HPV positive.
For other models, you can also specify the flags --read_len, --ngram, --model_dim, and --num_layers (as in main.py)
Inference batch size can be changed from the default (100) with --batch_size and GPU can be enabled by passing --cuda
Alternatively, the user can use the eval-only flag to run inference on XVir. See commands.sh for an example of this. This offers greater flexibility in terms fo the format of the input files.
The script main.py is the primary entry point for the XVir pipeline. It includes the functionality for training, testing, and validating an XVir model on custom data.
python main.py <args>
For example, when specifying training, test and validation sets, XVir can be trained by running
python main.py -s --train-data-file train_data.pkl --val-data-file val_data.pkl --test-data-file test_data.pkl --data-path data/ --device cuda
To prepare your data for training, please see the tools we have provided in the data folder.
The command line options for XVir are outlined below. The default values of these arguments, used to create our XVir model, can be found in utils/general_tools.py.
| Argument | Description | Default |
|---|---|---|
| --data-path | The path to load data | 'data' |
| --data-file | The relative path of data file from data-path | 'proc_data.pkl' |
| --train-data-file | Relative path of training data file | ('split/train_data.pkl' |
| --val-data-file | Relative path of load validation data file | 'split/val_data.pkl' |
| --test-data-file | Relative path to load validation data file | 'split/test_data.pkl' |
| --train-split | Fraction of data to use for training | 0.8 |
| --valid-split | Fraction of data to use for validation | 0.1 |
| --experiment-name | Name of the experiment | 'XVir' |
| --device | What to use for compute [GPU, CPU] will be called. | 'cuda' (Can specify 'cuda:[int]') |
| --seed | Random seed | 4 |
| --read_len | Read length | 150 |
| --ngram | Length of k-mer | 6 |
| --model-dim | The embedding dimension of transformer | 128 |
| --num-layers | The number of layers | 1 |
| --batch-size | The batch size | 100 |
| --dropout | Dropout rate (only for training) | 0.1 |
| --mask-rate | Masking rate (only for training) | None |
| --n-epochs | Number of epochs | 25 |
| --learning-rate | Learning rate | 0.001 |
| --weight-decay | Weight decay rate | 1e-6 |
| --eval-only | Only evaluate the model | N/A |
| -s | Passing splits for training and testing | N/A |
| --load-model | Load model | False |
| --model-path | Relative path to load model | logs/experiment/XVir.pt' |
| --model-save-interval | How often (in epochs) to save the model | 5 |
| --model-update-interval | How often (in epochs) to update the model | 2 |
| --model-save-path | Directory to save the trained model | 'logs/experiment/XVir_models' |
| --print-log-interval | How often (in epochs) to print training logs | 1 |
| --val-log-interval | How often (in epochs) to print validation logs | 5 |
If you use this software, please cite:
Consul, S., Robertson, J., & Vikalo, H. (2023). XVir: A Transformer-Based Architecture for Identifying Viral Reads from Cancer Samples. bioRxiv, 2023-08.
The correspodning BibTex is:
@article{consul2023xvir, title={XVir: A Transformer-Based Architecture for Identifying Viral Reads from Cancer Samples}, author={Consul, Shorya and Robertson, John and Vikalo, Haris}, journal={bioRxiv}, pages={2023--08}, year={2023}, publisher={Cold Spring Harbor Laboratory} }