Skip to content

2.4 Annotation

Karl-Svard edited this page May 26, 2020 · 7 revisions

Description

A functional annotation was then performed on the bins, in order to see what functional genes they might consist of.

Method and tools

The annotation was based on the software pipeline Prokka, that combines both structural and functional annotations of prokaryotic genomes. The command was run with the --force, that forces the output to overwrite any old existing one, and --addgenes which adds gene features to each CDS feature. The bins that were labeled as potential Archaea also had the additional --kingdom Archaea option enabled, as the default for Prokka is 'Bacteria'. Script performing this step was 07_combined_annotation.sh.

Results

The annotation resulted in each bin receiving several files describing the amount of genes identified and their potential function. A summary of these results is shown in table 1.

Table 1. Summary of annotation results

Bin name CDS tRNA rRNA tmRNA Hypothetical proteins
bin_1 1819 29 1819
bin_2 2418 32 953
bin_4 1182 40 1 610
bin_6 1289 34 582
bin_8 1530 9 633
bin_11 2872 52 2 1 1316
bin_12 2415 36 1 812
bin_14 1545 9 1 463
bin_15 1984 38 1 874
bin_17 1170 13 540
bin_18 1385 25 1 456
bin_19 1717 31 6 1 515
bin_20 688 11 344
bin_24 1508 36 1 1 584
bin_25 1297 11 575
bin_26 1405 36 1 1 575

Discussion

It hard to evaluate how well the annotation performed, but the results look promising and will most likely be good enough to progress further.

Questions from student manual

Annotation

What types of features are detected by the software? Which ones are more reliable a priori?
Prokka has detected the features: CDS, tRNA, rRNA and tmRNA. From these I would say that the highly conserved genes of rRNA and maybe tRNA/tmRNA are the most reliable. The highly conserved nature of these types of genes should make them easier to correctly identify.

How many features of each kind are detected in your contigs? Do you detect the same number of features as the authors? How do they differ?
The number of features per bin is seen in table 1, but the original article haven't published these results so no comparisons can be made.

Why is it more difficult to do the functional annotation in eukaryotic genomes?
The presence of splicing makes it harder to classify the function genes, and could even make it harder to identify the correct ORF. This is primarily because of alternate splicing and the chance of stop codons appearing within introns

How many genes are annotated as ‘hypothetical protein’? Why is that so? How would you tackle that problem?
The amount of genes annotated as a hypothetical protein for each bin is shown in table 1. These genes have been identified as possibly coding for proteins, but with no matches to any of the proteins in Prokka's databases. A way to tackle this problem would be to compare these sequences to other databases that might have them on record. It's even a possibility to compare to other species has functional genes often are to some degree conserved. A BLAST search could be an option for this.

How can you evaluate the quality of the obtained functional annotation?
You could compare your results to earlier studies or public databases that might contain functional region of genomes of the same species or closely related ones. Another option is to simply use your biological knowledge to assess how plausible the annotation is according to its context. But the most accurate, and extensive, method of evaluation the quality of the predicted annotation is to perform actual in vivo tests, like for example knock-out studies.

How comparable are the results obtained from two different structural annotation softwares?
They are directly comparable when looking at prokaryotic genes as each gene only results in one type of product and function. Exceptions for this is multipurpose proteins.

It becomes a bit more complex for eukaryotic genes because of the possibility of alternate splicing. This makes it possible for a singular gene to produce several proteins of different functions. Each software might thus identify different functions of genes even though they are both correct.

Clone this wiki locally