To define single-copy orthologues between C. bovis and C. elegans, we used a recipricol best BLAST hit approach. We first filtered the protein FASTAs so that only the longest isoform of each gene was present. We then used BLASTP to search each proteome against the other like so:
blastp -query CBOVIS.caenorhabditis_bovis.v1.proteins.faa_longest_isoforms -db CBOVIS.caenorhabditis_bovis.v1.proteins.faa_longest_isoforms -outfmt '6 qseqid sseqid pident mismatch gapopen qstart qend sstart send evalue bitscore qcovs qcovhsp qlen slen length' -num_threads 16 >CBOVIS_vs_CELEG.blastp.txt
blastp -query CELEG.caenorhabditis_elegans_N2_PRJNA13758_WBPS12.proteins.faa_longest_isoforms -db CELEG.caenorhabditis_elegans_N2_PRJNA13758_WBPS12.proteins.faa_longest_isoforms -outfmt '6 qseqid sseqid pident mismatch gapopen qstart qend sstart send evalue bitscore qcovs qcovhsp qlen slen length' -num_threads 16 >CELEG_vs_CBOVIS.blastp.txt
We then provided these output files to a script written by Dom Laetsch (and available here https://github.com/DRL/GenomeBiology2016_globodera_rostochiensis) to define orthogues like so:
./rbbh.py CBOVIS_vs_CELEG.blastp.txt CBOVIS.caenorhabditis_bovis.v1.proteins.faa_longest_isoforms CELEG_vs_CBOVIS.blastp.txt CELEG.caenorhabditis_elegans_N2_PRJNA13758_WBPS12.proteins.faa_longest_isoforms 1e-25 75 >CBOVIS_CELEG.rbbh.txt
Where:
1e-25 = minimum evalue for a hit to be considered
75 = min query coverage for a hit to be considered
The output of this script is a TSV file showing the transcript IDs and BLAST results.
We used the approach of Berriman et al. (2018) to create repeat libraries for C. bovis and C. elegans. These repeat library was then provided to RepeatMasker which generated a FASTA file with repeats masked with Ns. We then used repeat_density.py to calculate repeat density across each chromosome or contig like so:
./repeat_density.py CELEG.caenorhabditis_elegans_N2_PRJNA13758_WBPS12.scaffolds.fna.masked 50000
Where:
CELEG.caenorhabditis_elegans_N2_PRJNA13758_WBPS12.scaffolds.fna.masked = masked genome FASTA
50000 = bin size
We first need to extract stats for every transcript in each species using the GFF3 files. As part of the CGP, we have a standardised GFF3 format which makes life slightly easier. We calculated stats using gene_structure_stats.py like so:
./gene_structure_stats.py CELEG.caenorhabditis_elegans_N2_PRJNA13758_WBPS12.annotations.gff3`
Where:
CELEG.caenorhabditis_elegans_N2_PRJNA13758_WBPS12.annotations.gff3 = GFF3
The script produces a TSV file like the one below:
CELEG.F08H9.5 1846 5 1146 4 659 41 CELEG.WBGene00008593
Where the columns are:
- transcriptID
- transcript length
- CDS count
- CDS span
- intron count
- intron span
- geneID
We can then use the output files in combination with a orthology clustering file (or a recipricol best BLAST hit file) to compare the gene structure in orthologous gene pairs.