About crossover detection and analysis pipeline - coda

Coda is a tool which predicts and visualizes crossovers in F2 samples. It aims to guide the user though a step-by-step procedure from raw reads (fastq files) to results in the form of plots and tables.

The core hypothesis of this project is that given assemblies of two plants A and B, reads from plant A would map to assembly A over B. As such, Coda relies on alignment to what we termed a "dual genome" - a concatnated reference of the two parents.

Development of Coda was done in Weizmann Institute, Avraham Levy's Lab by Jules Zisser.

Pipeline prerequisites

BWA
samtools
Picard
bedtools
Python 3.7+ with:
- pandas
- matplotlib
- hmmlearn

Coda workflow

Assembling the reference

Change fasta file headers. Determines the chromosome notation used downstream. Use distinct, alphabetical names. Do not use digits.

 awk '/^>/{print ">alpha" ++i; next}{print}' < genomeA.fasta
 awk '/^>/{print ">beta" ++i; next}{print}' < genomeB.fasta

Concatenate reference files. This is the reference all samples are mapped to.

cat genomeA.fasta genomeB.fasta > genomeAB.fasta

Generate genome sizes. Used to generate bins. View genomeAB_sizes.txt afterwards to verify the genome names and sizes.

# module load samtools-gnu 
samtools faidx genomeAB.fasta
cut -f1,2 genomeAB.fasta.fai > genomeAB_sizes.txt # notice input is .fai here

Index concatenated reference. Required for BWA alignment.

# module load bwa-gnu
bwa index genomeAB.fasta

Create bin template

Map source alignments and filter. Using BWA mem, map each pair of reads genomeA_R1_00N.fastq.gz, genomeB_R2_00N.fastq.gz to genomeAB.fasta (-t 5 is the amount of parallel threads, it does not alter the alignment result). Pipe the output to samtools in order to filter out reads with quality / Fred score < 30 and convert to BAM format (binary / smaller form of SAM).

# module load bwa-gnu samtools-gnu
for R1 in genomeA_control_L001_00*R1.fastq.gz; 
do
    bwa mem -t 5 references/genomeAB.fasta $R1 $R2 | samtools view -Sbh -q 30 - > genomeA_control_00N.bam
done

Merge and sort all alignment files of the same library.

samtools merge genomeA_control_001.bam genomeA_control_002.bam  [...] > genomeA_control_merged.bam
samtools sort genomeA_control_merged.bam > genomeA_control_merged.sorted.bam

Deduplicate in order to retain unique reads

# module load jdk picard
picard MarkDuplicates I=genomeA_control_merged.sorted.bam O=genomeA_control_merged.sorted.dedup.bam M=genomeA_control_merged.dedup_metrics.txt REMOVE_DUPLICATES=true

Create bins table (decide on bin size, step size) using make_bins.py.

python make_bins.py genomeAB_sizes.txt 200

Calcuate control coverage to see where each parent aligns

bedtools coverage -counts -sorted -a genomeAB_sizes.bins.txt -b genomeA_control_merged.sorted.dedup.bam > genomeA_control.cov-sizes.bedgraph

Filter uninformative bins with filter_genomes.py. This results in reads that aren't unique to either parent to be thrown out, resulting in a much cleaner template called in this example genomeAB_template.cov-sizes.bedgraph.

python filter_genomes.py genomeA_control.cov-sizes.bedgraph genomeB_control.cov-sizes.bedgraph genomeAB_template.cov-sizes.bedgraph

Optional: Calculate average bin coverage.

cat genomeA_control.cov-sizes.bedgraph | awk '{sum+=$4} END { print "Average = ",sum/NR}'

Prepare individual dataset

Sample level steps can be mass deployed using the script processing_pipeline.sh coupled with dirs_process.sh.
The pipeline expects as an input the R1 fastq file of a reads pair.
All of the commands are the same as processing the control samples, except for the last step: when calculating coverage, use the template genomeAB_template.cov-sizes.bedgraph generated from filter_genome.py in step 2.2.9 instead of genomeAB_sizes.bins.txt.

Map sample reads to concatenated reference genomeAB.fasta
Quality filter and sort
Remove duplicates
Get coverage over bin template using bedtools

Analysis with Coda

There are multiple python dependencies required in order for coda.py to run. The recommended way to set up a suitable environment is by using Anaconda3.

module load anaconda # for servers that use "module" imports such as wexac
conda env create -f coda_env.yml # creates a virtual environment with specified requirements
conda activate xoe # activate it
python coda.py my_sample.dedup.cov.bedgraph predict # running the default pipeline on file "my_sample"

For advanced users who would rather use their own environments, the dependencies are listed below. Later versions of the packages are likely to work, but were not tested.

  - python=3.7.*                                   
  - pandas=0.24.*
  - matplotlib=3.1.2
  - hmmlearn=0.2.3

Module documentation

transform_data() - aggregating read count

Two methods: arbitrary or robust, which use group_count or bin_size respectively. Arbitrary is used by default.
Plotted below is an example of the sample data processed by the different methods.

Arbitrary - aggregate every group_count points.

Consistent distribution across samples
Less bins
Clear Heterozygote 0 group
Better results overall

Robust - all points in each bin_size section are grouped

Distribution highly affected by coverage
More bins
Bins contain very few datapoints
Better defined aggregation

fig = plt.figure(figsize=(12,4))

for i, method in enumerate(['arbitrary', 'robust']):
    s = coda.CrossoverDetector(log_level=2)
    s.transform_data(kiril_files(10), method=method)
    ax = plt.subplot(1,2,i+1)
    ax.hist(s._data.val, bins=np.arange(0, 1.01, 0.05), color='skyblue')
    ax.set_ylabel('Bin count')
    ax.set_xlabel('Normalised Reads per bin')
    ax.set_title(f'Grouping by {method.capitalize()} Method (total={len(s._data)})')
plt.tight_layout()

read_context() - input centromere data

Centromeres tend to have a lot of variations. Adding centromere positions means these inconsistent regions will be taken out before training, as well as marked in the results.
As of 2019, the context positions in fname needs to be listed as follows:

#	end	end
chr1	14000000	15599999
chr2	2900000	3949999
chr3	13600000	14549999
chr4	2000000	4259999
chr5	10930000	12659999

Hidden markov model training

By default the model will be initialized with 3 states - Homozygote A, Heterozygote, and Homozygote B. Each state is defined by mean, variance, and transition odds. During the fitting process these parameters are adjusted in a process called Expectation-maximization, or EM for short. It's recommended to train the model on multiple samples in order to produce more reliable results.

Approximate position offset:

min       166.50
p25      2364.25
p50      6860.50
p75     13613.50
max    331572.50

[Experimental] Syri - genome matching

Note: Syri was considered as a possibility to match crossover locations during development, but was eventually discarded.

module load mummer
nucmer --maxmatch -c 100 -b 500 -l 50 refgenome qrygenome
delta-filter -m -i 90 -l 100 out.delta > out_m_i90_l100.delta
show-coords -THrd out_m_i90_l100.delta > out_m_i90_l100.coords
~/apps/syri/syri/bin/syri -c out_m_i90_l100.coords -r TAIR10_filtered.fasta -q ../Denovo_Ler_Assembly/GCA_001651475.1_Ler_Assembly_genomic_filtered.fasta -d out_m_i90_l100.delta --nc 5

Sources:
Syri Publication Additional File 2 Col-0-Ler
https://www.biostars.org/p/49820/
Assembly match 3 Syri

Python usage example

import coda

# Notebook only settings
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Coda can be used as either a script or a module, which share the same functions. In this example it will be used as a python module.
The same result can be achieved by running the script like so:

python coda.py --group-count 10 --context centromere_pos.bed --hmm res_hmm.pickle KirilsF2_7_S7_.dedup.cov.bedgraph predict --smooth-model v1 --smooth-window 10

We begin by importing coda to our project, as well as other packages used for data manipulation and visualisation.

sample_file = 'README_files/example.dedup.cov-200.bedgraph'
sample1 = coda.CrossoverDetector()
sample1.coda_preprocess(sample_file, context_fname='centromere_pos.bed', group_count=10)
                        #'res_hmm.pickle'
sample1.coda_detect(smooth_model='v1', smooth_window=10)

[2020-01-23 14:50:01.754,  INFO] Loading and aggregating data from file "/home/labs/alevy/zisserh/proj/crossover_detection/README_files/example.dedup.cov-200.bedgraph" according to method "arbitrary".
[2020-01-23 14:50:02.418,  INFO] Successfully loaded 'README_files/example.dedup.cov-200.bedgraph'. Mean coverage is 10.732.
[2020-01-23 14:50:02.418,  INFO] Reading centromere boundries from file "centromere_pos.bed".
[2020-01-23 14:50:02.425,  INFO] Started genome_hmm_detection() on "KirilsF2_7_S7_.dedup.cov.bedgraph"
[2020-01-23 14:50:02.425,  INFO] Fitting HMM based on this sample.
[2020-01-23 14:50:02.603,  WARN] Initializing new model.
[2020-01-23 14:50:06.257,  INFO] Predicting most likely state according to HMM.
[2020-01-23 14:50:06.784,  INFO] Extractring initial transition locations.
[2020-01-23 14:50:06.806,  INFO] Found 289 suspects.
[2020-01-23 14:50:06.806,  INFO] Smoothing by filling gaps <10.
[2020-01-23 14:50:06.961,  INFO] Re-exctracting crossover positions.
[2020-01-23 14:50:06.983,  INFO] Found 29 suspects post smoothing.
[2020-01-23 14:50:06.983,  INFO] Looking for reciprocal crossover positions.
[2020-01-23 14:50:07.277,  INFO] 6 crossovers found, 2 more possible duplicate(s).
[2020-01-23 14:50:07.278,  INFO] Done.

plt.figure(figsize=(8,4))
sample1._raw_data.val.hist(bins=np.arange(sample1._raw_data.val.max()), label='Read count')
plt.ylabel('Bin count')
plt.xlabel('Reads per bin')
plt.title('Pre Transformation Coverage Per Bin Distribution')

plt.figure(figsize=(8,4))
plt.ylabel('Bin count')
plt.xlabel('Reads per bin')
plt.title('HMM Determined State Distribution')

for pred, name in enumerate(['Homozygote 0', 'Heterozygote', 'Homozygote 1']):
    sample1._data_predictions[sample1._data_predictions.pred == pred].val.hist(bins=np.arange(0,1,0.05),
                                                                                   alpha=0.8, label=name)
plt.legend()

sample1.plot_hmm(val_column='pred')

sample1.crossovers

	chrm	start	end	val	pred	probs	smooth	updated	change	abs_pos	pos_ler	dist	possible_dup	binsize	binsize_percentile	context
2298	col1	15079201	15086201	0.0175	0	[0.983 0.017 0. ]	-1	0	1	0.4958	13884201	0.0234	False	7000	82.91	centromere
2326	col1	15488401	15491401	0.035	1	[0.001 0.999 0. ]	-1	1	0	0.5091	14099001	0.0294	False	3000	47.87	centromere
5071	col1	29377401	29381001	0	0	[1. 0. 0.]	-1	0	1	0.9656	28350001	0.0009	False	3600	56.66	nan
9632	col3	5068201	5077801	0	0	[1. 0. 0.]	-1	0	1	0.2165	5035001	0.0064	False	9600	91.05	nan
12763	col3	18397201	18411801	0.2187	1	[0. 0.961 0.039]	-1	1	2	0.7849	17583601	0.0065	False	14600	96.65	nan
13317	col3	22814601	22819001	0.7143	2	[0. 0. 1.]	-1	2	1	0.9728	22034201	0.0027	False	4400	66.69	nan
16872	col4	18031201	18036201	0.2041	1	[0. 1. 0.]	-1	1	0	0.9705	17426801	0.0013	False	5000	71.04	nan
19813	col5	14959801	14964601	0.2945	1	[0. 0.934 0.066]	-1	1	2	0.5547	14323801	0.0116	False	4800	68.86	nan

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README_files		README_files
scripts		scripts
Complete Process.ipynb		Complete Process.ipynb
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
coda.py		coda.py
coda_env.yml		coda_env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About crossover detection and analysis pipeline - coda

Pipeline prerequisites

Coda workflow

Assembling the reference

Create bin template

Prepare individual dataset

Analysis with Coda

Module documentation

transform_data() - aggregating read count

read_context() - input centromere data

Hidden markov model training

[Experimental] Syri - genome matching

Python usage example

About

Uh oh!

Releases

Packages

Languages

License

zisserj/coda

Folders and files

Latest commit

History

Repository files navigation

About crossover detection and analysis pipeline - coda

Pipeline prerequisites

Coda workflow

Assembling the reference

Create bin template

Prepare individual dataset

Analysis with Coda

Module documentation

transform_data() - aggregating read count

read_context() - input centromere data

Hidden markov model training

[Experimental] Syri - genome matching

Python usage example

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages