ADAPTIVEMASKING

We describe tools to examine NGS-mapped output to identify regions of high minor variant frequencies, as described. The rationale for doing this is the identification of regions where consensus basecalling may be unreliable. The process is 'adaptive' in the sense that it models variation which arises during the entire laboratory, sequencing and mapping process and can be used to identify regions of variation resulting anywhere in the process followed. In our case, increased variation occurred in regions of homology between Mycobacteria and other bacterial species, and was detected when laboratory processes started to use broth-culture derived DNA extracts, rather than extracts from pure cultures. Use of the technique described here allowed identification and masking of the problem regions.

The output from the tools described include interactive plots describing the determinants of mixtures across the mapped genome and spreadsheets containing the same data.

Obtain software and test data
The software needed, and instructions on how to obtain test data, is here.

Quick start
A demonstration of the end-to-end process using test data follows.

# assumes test data is present

# start in the directory into which the project is cloned
# make scripts executable
cd pipeline/testdata
chmod +x *.sh

# step 1: mapping & vcf generation
./map_with_bowtie.sh        # takes about 90 mins for 50 samples on our hardware
# optional alternative using nohup
# nohup ./map_with_bowtie.sh > bowtie.out 2>bowtie.err &

# step 2: run Kraken on samples
./run_kraken.sh             # takes about 50 minutes for 50 samples on our hardware

# step 3: determine minor allele frequencies
# reference genome is NC_000962
# path to vcf files is as shown (quotes essential - or linux expands the path, which is not wanted)
# AD is the tag to use
# ../output is the target directory
# takes about 60 minutes for 50 samples on our hardware
# core syntax:
# python3 ../../src/extract_mixed.py ../../testdata/NC_000962.3.gb "*/*.bowtie.vcf.gz" AD ../output/bowtie
 
# recommend using nohup
nohup python3 ../../src/extract_mixed.py ../../testdata/NC_000962.3.gb "*/*.bowtie.vcf.gz" AD ../output/bowtie > exmix.out 2>exmix.err &

Overview of the process followed

Mapping and VCF generation
Our approach will operate on the output from multiple mappers, with different settings, and with different kinds of input data. The objective of the project is quantify the variation associated with the totality of a laboratory process, input DNA, sequencing, and mapping. Many bioinformatic pipelines will have already optimised mapping tools and settings; the process we describe will operate on their output.
As input it expects VCF files, which are typically generated by samtools/bcftools mpileup commands following mapping. In particular, it expects a tag in the VCF INFO section which contains the high quality base counts at each position: it is these which are the input to the algorithm. For more detail on this, see here.
Please see example code illustrating of mapping and vcf generation operations on a test dataset.
Estimating the amount of extraneous bacterial DNA present
One of the key findings from our paper is that mapping accuracy for some regions is determined by the amount of 'non-target' bacteria DNA present, in our case from species other than Mycobacteria. We estimated this using Kraken. Newer tools producing data in a similar format can also be used. Please see example code illustrating the use of Kraken, including the KrakenReportReader class from the KrakenReportReader module.
Determining the minor variant frequencies for genomic regions
We measured the minor variant frequencies, - that is, the read depth accounted for by calls other than the most common nucleotide - across genomic regions. Please see here, which describes use of the regionScan_from_genbank class, which is in the vcfScan module.
Modelling minor variant frequencies in reads mapped to genomic regions
Subsequently, we fitted Poisson models, region-by_region, estimating the relationship between the minor variant call depth (~ amount of mixture ) detected and an estimate of the amount of non-bacterial DNA present. A python class, AdaptiveMasking, in the AdaptiveMasking module, is provided to do this. Its use is described in detail
Depicting the results of the modelling
Methods in the AdaptiveMasking class allow depiction of model output, as described.
Masking based on the results
What to do next

Name		Name	Last commit message	Last commit date
Latest commit History 110 Commits
doc		doc
pipeline		pipeline
src		src
testdata		testdata
unittest_tmp		unittest_tmp
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ADAPTIVEMASKING

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ADAPTIVEMASKING

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages