Genetic variation in Anopheles RDL genes

What is this

The scripts and data in this repository can be used to reproduce all analyses from the manuscript Resistance to dieldrin evolution in African malaria vectors is driven by interspecific and interkaryotypic introgression (Grau-Bové et al., MBE 2020).

Genome variation data for this project has been generated as part of the Anopheles gambiae 1000 Genomes Consortium.

karyotype_2La_phase2.ipynb to karyotype 2La inversions in Ag1000G phase 2 data, using genotype frequencies and known karyotypes from Phase 1. In theory you can use it to karyotype any other inversion as long as you've got a training set. Output goes to results_karyotype, and is used in Figure 5. Some files are also used in the admixture notebooks.
haplotype_analysis_26nov19.ipynb can be used to calculate genotype frequencies, build haplotype networks, perfom positive selection scans along the gene & chromosome, and to obtain haplotype alignments. Output goes to results_haplotype_analysis and is used in Figures 1-4, 6 and 8).
admixture_h22nov19_296G.ipynb and admixture_h22nov19_296S.ipynb: perform Patterson's D tests of introgression (aka ABBA-BABA test) between various pairs of populations. Output goes to results_admixture and is used in Figure 7.

These scripts are available as ipython notebooks (they can be visualised in github) and python scripts.

Downstream analyses of Rdl alignments:

alignments_Rdl_haplotype_phylo: alignments of haplotypes from the Ag1000G dataset, log files from iqtree ML phylogenetic analyses, and a R script to create phylogenetic visualisations (00_phylodist_04tip_17jun19.R, as in Figure 6).
alignments_Rdl_multisps: alignments of Rdl (CDS and peptides) from multiple mosquito species (A. gambiae, A. arabiensis, A. melas, A. merus, A. christyi, A. epiroticus, A. minimus, A. culicifacies, A. funestus, A. stephensi, A. maculatus, A. farauti, A. dirus, A. atroparvus, A. sinensis, A. albimanus, A. darlingi, Aedes aegypti, Aedes albopictus, and Culex quinquefasciatus) and a couple of R scripts to calculate pairwise identity (pairwise_identity.R) and dN/dS ratios (pairwise_dNdS.R).

Other files & folders:

data folder with metadata for the scripts above (sample info, karyotypes, etc.).
scripts_hapclust and scripts_printtranscripts: some helper functions.

Data

Where is the input data?

All metadata required is in the data folder
Some accessory scripts are also available in the scripts_hapclust and scripts_printtranscripts folders
Genomic variation data has to be downloaded from the Ag1000G project archive. These are huge files that don't fit in this repository. Download links for Phase1-AR3 and Phase2-AR1:

ftp://ngs.sanger.ac.uk/production/ag1000g/phase1/AR3/
ftp://ngs.sanger.ac.uk/production/ag1000g/phase2/AR1/

Notes on data download:

All genome genome variation files you need are specified at the beginning of each python notebook. Once you've downloaded them, edit the scripts to point to the relevant files. Variables to be edited are marked with #### EDIT THIS comments.
Data is available for download in various formats (VCFs, zarr, and HDF5). The scripts above use the zarr arrays and HDF5 files, which are highly compressed and very handy to use compared to VCFs. The python scripts require some special libraries to deal with these formats, mostly implemented in the scikit-allel, zarr and h5py libraries (see dependencies below).
phased variants are available under the haplotype/main subfolder:

ftp://ngs.sanger.ac.uk/production/ag1000g/phase2/AR1/haplotypes/main/
ftp://ngs.sanger.ac.uk/production/ag1000g/phase1/AR3/haplotypes/main/

nucleotide accessibility arrays in HDF5 format:

ftp://ngs.sanger.ac.uk/production/ag1000g/phase2/AR1/accessibility/
ftp://ngs.sanger.ac.uk/production/ag1000g/phase1/AR3/accessibility/

other metadata files:

ftp://ngs.sanger.ac.uk/production/ag1000g/phase2/AR1/samples/
ftp://ngs.sanger.ac.uk/production/ag1000g/phase1/AR3/samples/

Dependencies

Python notebooks work with Python 3.7.4 and the following libraries, which can all be installed using conda:

numpy 1.17.3
zarr 2.3.2
pandas 0.25.3
scikit-allel, allel 1.2.1
scikit-learn, sklearn 0.21.3
h5py 2.10.0
scipy 1.3.2
bcolz 1.2.1
matplotlib 3.1.2
seaborn 0.9.0
itertools 7.2.0

R scripts work with R 3.6.1 and require the following libraries:

seqinr 3.4-5
ape 5.3
phytools 0.6-60
pheatmap 1.0.12

If you use these scripts in your own work, please do not forget to cite the relevant packages as well. It's free and it makes everyone happy :)

For example, in R:

> citation("ape")

To cite ape in a publication use:

  Paradis E. & Schliep K. 2018. ape 5.0: an environment for modern
  phylogenetics and evolutionary analyses in R. Bioinformatics 35:
  526-528.

A BibTeX entry for LaTeX users is

  @Article{,
    title = {ape 5.0: an environment for modern phylogenetics and evolutionary analyses in {R}},
    author = {E. Paradis and K. Schliep},
    journal = {Bioinformatics},
    year = {2018},
    volume = {35},
    pages = {526-528},
  }

As ape is evolving quickly, you may want to cite also its version
number (found with 'library(help = ape)' or
'packageVersion("ape")').

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Genetic variation in Anopheles RDL genes

What is this

Contents

Data

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
alignments_Rdl_haplotype_phylo		alignments_Rdl_haplotype_phylo
alignments_Rdl_multisps		alignments_Rdl_multisps
data		data
results_admixture		results_admixture
results_haplotype_analysis		results_haplotype_analysis
results_karyotype		results_karyotype
scripts_hapclust		scripts_hapclust
scripts_printtranscripts		scripts_printtranscripts
.gitignore		.gitignore
README.md		README.md
admixture_h22nov19_296G.ipynb		admixture_h22nov19_296G.ipynb
admixture_h22nov19_296G.py		admixture_h22nov19_296G.py
admixture_h22nov19_296S.ipynb		admixture_h22nov19_296S.ipynb
admixture_h22nov19_296S.py		admixture_h22nov19_296S.py
haplotype_analysis_26nov19.ipynb		haplotype_analysis_26nov19.ipynb
haplotype_analysis_26nov19.py		haplotype_analysis_26nov19.py
haplotype_analysis_v2_11mar20.ipynb		haplotype_analysis_v2_11mar20.ipynb
karyotype_2La_phase2.ipynb		karyotype_2La_phase2.ipynb
karyotype_2La_phase2.py		karyotype_2La_phase2.py

Folders and files

Latest commit

History

Repository files navigation

Genetic variation in Anopheles RDL genes

What is this

Contents

Data

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages