-
Notifications
You must be signed in to change notification settings - Fork 0
2.4 Annotation
A functional annotation was then performed on the bins, in order to see what functional genes they might consist of.
The annotation was based on the software pipeline Prokka, that combines both structural and functional annotations of prokaryotic genomes. The command was run with the --force, that forces the output to overwrite any old existing one, and --addgenes which adds gene features to each CDS feature. The bins that were labeled as potential Archaea also had the additional --kingdom Archaea option enabled, as the default for Prokka is 'Bacteria'. Script performing this step was 07_combined_annotation.sh.
The annotation resulted in each bin receiving several files describing the amount of genes identified and their potential function. A summary of these results is shown in table 1.
Table 1. Summary of annotation results
| Bin name | CDS | tRNA | rRNA | tmRNA | Hypothetical proteins |
|---|---|---|---|---|---|
| bin_1 | 1819 | 29 | 1819 | ||
| bin_2 | 2418 | 32 | 953 | ||
| bin_4 | 1182 | 40 | 1 | 610 | |
| bin_6 | 1289 | 34 | 582 | ||
| bin_8 | 1530 | 9 | 633 | ||
| bin_11 | 2872 | 52 | 2 | 1 | 1316 |
| bin_12 | 2415 | 36 | 1 | 812 | |
| bin_14 | 1545 | 9 | 1 | 463 | |
| bin_15 | 1984 | 38 | 1 | 874 | |
| bin_17 | 1170 | 13 | 540 | ||
| bin_18 | 1385 | 25 | 1 | 456 | |
| bin_19 | 1717 | 31 | 6 | 1 | 515 |
| bin_20 | 688 | 11 | 344 | ||
| bin_24 | 1508 | 36 | 1 | 1 | 584 |
| bin_25 | 1297 | 11 | 575 | ||
| bin_26 | 1405 | 36 | 1 | 1 | 575 |
It hard to evaluate how well the annotation performed, but the results look promising and will most likely be good enough to progress further.
What types of features are detected by the software? Which ones are more reliable
a priori?
Prokka has detected the features: CDS, tRNA, rRNA and tmRNA. From these I would say that the highly conserved genes of rRNA and maybe tRNA/tmRNA are the most reliable. The highly conserved nature of these types of genes should make them easier to correctly identify.
How many features of each kind are detected in your contigs? Do you detect the
same number of features as the authors? How do they differ?
The number of features per bin is seen in table 1, but the original article haven't published these results so no comparisons can be made.
Why is it more difficult to do the functional annotation in eukaryotic genomes?
The presence of splicing makes it harder to classify the function genes, and could even make it harder to identify the correct ORF. This is primarily because of alternate splicing and the chance of stop codons appearing within introns
How many genes are annotated as ‘hypothetical protein’? Why is that so? How
would you tackle that problem?
The amount of genes annotated as a hypothetical protein for each bin is shown in table 1. These genes have been identified as possibly coding for proteins, but with no matches to any of the proteins in Prokka's databases. A way to tackle this problem would be to compare these sequences to other databases that might have them on record. It's even a possibility to compare to other species has functional genes often are to some degree conserved. A BLAST search could be an option for this.
How can you evaluate the quality of the obtained functional annotation?
You could compare your results to earlier studies or public databases that might contain functional region of genomes of the same species or closely related ones. Another option is to simply use your biological knowledge to assess how plausible the annotation is according to its context. But the most accurate, and extensive, method of evaluation the quality of the predicted annotation is to perform actual in vivo tests, like for example knock-out studies.
How comparable are the results obtained from two different structural annotation
softwares?
They are directly comparable when looking at prokaryotic genes as each gene only results in one type of product and function. Exceptions for this is multipurpose proteins.
It becomes a bit more complex for eukaryotic genes because of the possibility of alternate splicing. This makes it possible for a singular gene to produce several proteins of different functions. Each software might thus identify different functions of genes even though they are both correct.