Skip to content

davidhwyllie/ADAPTIVEMASKING

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

110 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ADAPTIVEMASKING

We describe tools to examine NGS-mapped output to identify regions of high minor variant frequencies, as described. The rationale for doing this is the identification of regions where consensus basecalling may be unreliable. The process is 'adaptive' in the sense that it models variation which arises during the entire laboratory, sequencing and mapping process and can be used to identify regions of variation resulting anywhere in the process followed. In our case, increased variation occurred in regions of homology between Mycobacteria and other bacterial species, and was detected when laboratory processes started to use broth-culture derived DNA extracts, rather than extracts from pure cultures. Use of the technique described here allowed identification and masking of the problem regions.

The output from the tools described include interactive plots describing the determinants of mixtures across the mapped genome and spreadsheets containing the same data.

Obtain software and test data
The software needed, and instructions on how to obtain test data, is here.

Quick start
A demonstration of the end-to-end process using test data follows.

# assumes test data is present

# start in the directory into which the project is cloned
# make scripts executable
cd pipeline/testdata
chmod +x *.sh

# step 1: mapping & vcf generation
./map_with_bowtie.sh        # takes about 90 mins for 50 samples on our hardware
# optional alternative using nohup
# nohup ./map_with_bowtie.sh > bowtie.out 2>bowtie.err &

# step 2: run Kraken on samples
./run_kraken.sh             # takes about 50 minutes for 50 samples on our hardware

# step 3: determine minor allele frequencies
# reference genome is NC_000962
# path to vcf files is as shown (quotes essential - or linux expands the path, which is not wanted)
# AD is the tag to use
# ../output is the target directory
# takes about 60 minutes for 50 samples on our hardware
# core syntax:
# python3 ../../src/extract_mixed.py ../../testdata/NC_000962.3.gb "*/*.bowtie.vcf.gz" AD ../output/bowtie
 
# recommend using nohup
nohup python3 ../../src/extract_mixed.py ../../testdata/NC_000962.3.gb "*/*.bowtie.vcf.gz" AD ../output/bowtie > exmix.out 2>exmix.err &
 

Overview of the process followed

  1. Mapping and VCF generation
    Our approach will operate on the output from multiple mappers, with different settings, and with different kinds of input data. The objective of the project is quantify the variation associated with the totality of a laboratory process, input DNA, sequencing, and mapping. Many bioinformatic pipelines will have already optimised mapping tools and settings; the process we describe will operate on their output.
    As input it expects VCF files, which are typically generated by samtools/bcftools mpileup commands following mapping. In particular, it expects a tag in the VCF INFO section which contains the high quality base counts at each position: it is these which are the input to the algorithm. For more detail on this, see here.
    Please see example code illustrating of mapping and vcf generation operations on a test dataset.
  2. Estimating the amount of extraneous bacterial DNA present
    One of the key findings from our paper is that mapping accuracy for some regions is determined by the amount of 'non-target' bacteria DNA present, in our case from species other than Mycobacteria. We estimated this using Kraken. Newer tools producing data in a similar format can also be used. Please see example code illustrating the use of Kraken, including the KrakenReportReader class from the KrakenReportReader module.
  3. Determining the minor variant frequencies for genomic regions
    We measured the minor variant frequencies, - that is, the read depth accounted for by calls other than the most common nucleotide - across genomic regions. Please see here, which describes use of the regionScan_from_genbank class, which is in the vcfScan module.
  4. Modelling minor variant frequencies in reads mapped to genomic regions
    Subsequently, we fitted Poisson models, region-by_region, estimating the relationship between the minor variant call depth (~ amount of mixture ) detected and an estimate of the amount of non-bacterial DNA present. A python class, AdaptiveMasking, in the AdaptiveMasking module, is provided to do this. Its use is described in detail
  5. Depicting the results of the modelling
    Methods in the AdaptiveMasking class allow depiction of model output, as described.
  6. Masking based on the results
    What to do next

About

Examine NGS mapped output in VCF/BCF files to identify regions of high variation

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors