-
Notifications
You must be signed in to change notification settings - Fork 1
Data processing
VAST employs a novel alignment strategy (see Verhey et. al., 2017) to ensure that antigenic variants produced through segmental recombination can be traced in a consistent and unbiased way.

This strategy requires that a multiple sequence alignment (MSA) of the cassettes be computed using vast align_cassettes, shown in B-F. Then, reads are aligned to the reference and cassettes using vast map, shown in G-I.
The algorithm makes an MSA for each reference in the database that has cassette sequences associated with it. Although the command is simple, the MSA is computed in two steps:
- Doing a local alignment of each cassette to the reference retaining all equivalent alignments.
- Finding the optimal MSA using a simulated annealing algorithm.
The MSA serves as a basis for the sequence alignments, and therefore, although there are different equally-scoring MSAs, only one is selected and retained in the database. Upon cassette alignment, the /my_database/Cassette Alignments/ folder will be populated with many equivalent multiple sequence alignments, in both .bam and .pyc formats. BAM-format files can be viewed with IGV or any other BAM viewer. By default, VAST chooses one of the best MSAs at random; however, you can use the --prealigned option to import a specific .pyc file as the basis MSA.
To align the cassettes for all references in the database without already computed multiple sequence alignments:
vast align_cassettesThe --force option can also be used to recompute multiple sequence alignments for references that already had them.
usage: vast align_cassettes [-h] [-p pyc_file] [-f]
optional arguments:
-h, --help show this help message and exit
-p pyc_file, --prealigned pyc_file
A .pyc file representing previously computed multiple
alignment.
-f, --force Force recomputing alignments already in the database.
Once the cassette sequences have been aligned for each reference, sequence data can be mapped to the reference. By default, VAST maps each read, and then identifies which mappings are most similar to the silent cassette MSA, before it selects those that are closest (by minimizing the map distance to each of the cassettes). This ensures that SNPs that could be templated are treated as templated, while retaining ambiguity in SNPs that are nontemplated. Consequently, this scheme minimizes bias in the placement of non-templated SNPs. However, --justify can be used to increase performance and decrease memory usage. Using this option, the final set of alignments is reduced to a single left-justified alignment, which biases non-templated SNPs to the left, but still correctly locates templated SNPs.
This function is optimized for multiprocessing, and the --cpus argument can be used to specify the number of processes to use.
Unmapped reads are mapped using the following command:
vast mapIn addition, the --force option can be used to map previously mapped reads in the database.
usage: vast map [-h] [-j] [-f] [-c CPUS]
optional arguments:
-h, --help show this help message and exit
-j, --justify Do not use all equivalent alignments; instead, use a
single alignment that is the left-justified one. This
eliminates read mapping ambiguity and decreases memory
usage, but introduces a bias into alignments.
-f, --force Remap already mapped reads.
-c CPUS, --cpus CPUS Specify the number of CPUs to use for processing.
Feedback? Report an issue here.