Skip to content

Data processing

Ted Verhey edited this page Jul 16, 2019 · 18 revisions

VAST employs a novel alignment strategy (see Verhey et. al., 2017) to ensure that antigenic variants produced through segmental recombination can be traced in a consistent and unbiased way.

Alignment Method

This strategy requires that a multiple sequence alignment (MSA) of the cassettes be computed using vast align_cassettes, shown in B-F. Then, reads are aligned to the reference and cassettes using vast map, shown in G-I.

Aligning silent cassettes

The algorithm makes an MSA for each reference in the database that has cassette sequences associated with it. Although the command is simple, the MSA is computed in two steps:

  1. Doing a local alignment of each cassette to the reference retaining all equivalent alignments.
  2. Finding the optimal MSA using a simulated annealing algorithm.

The MSA serves as a basis for the sequence alignments, and therefore, although there are different equally-scoring MSAs, only one is selected and retained in the database. Upon cassette alignment, the /my_database/Cassette Alignments/ folder will be populated with many equivalent multiple sequence alignments, in both .bam and .pyc formats. BAM-format files can be viewed with IGV or any other BAM viewer. By default, VAST chooses one of the best MSAs at random; however, you can use the --prealigned option to import a specific .pyc file as the basis MSA.

Example

To align the cassettes for all references in the database without already computed multiple sequence alignments:

vast align_cassettes

The --force option can also be used to recompute multiple sequence alignments for references that already had them.

Usage

usage: vast align_cassettes [-h] [-p pyc_file] [-f]

optional arguments:
  -h, --help            show this help message and exit
  -p pyc_file, --prealigned pyc_file
                        A .pyc file representing previously computed multiple
                        alignment.
  -f, --force           Force recomputing alignments already in the database.

Mapping sequences

Once the cassette sequences have been aligned for each reference, sequence data can be mapped to the reference. By default, VAST maps each read, and then identifies which mappings are most similar to the silent cassette MSA, before it selects those that are closest (by minimizing the map distance to each of the cassettes). This ensures that SNPs that could be templated are treated as templated, while retaining ambiguity in SNPs that are nontemplated. Consequently, this scheme minimizes bias in the placement of non-templated SNPs. However, --justify can be used to increase performance and decrease memory usage. Using this option, the final set of alignments is reduced to a single left-justified alignment, which biases non-templated SNPs to the left, but still correctly locates templated SNPs.

This function is optimized for multiprocessing, and the --cpus argument can be used to specify the number of processes to use.

Example

Unmapped reads are mapped using the following command:

vast map

In addition, the --force option can be used to map previously mapped reads in the database.

Usage

usage: vast map [-h] [-j] [-f] [-c CPUS]

optional arguments:
  -h, --help            show this help message and exit
  -j, --justify         Do not use all equivalent alignments; instead, use a
                        single alignment that is the left-justified one. This
                        eliminates read mapping ambiguity and decreases memory
                        usage, but introduces a bias into alignments.
  -f, --force           Remap already mapped reads.
  -c CPUS, --cpus CPUS  Specify the number of CPUs to use for processing.

Clone this wiki locally