Protein sequence output containing stop codons "*"

Hello,

From my understanding, the [JGI/IMG pipeline uses Prodigal](https://journals.asm.org/doi/10.1128/msystems.00804-20) to predict the metagenomic protein sequences from the contigs. I have compiled a protein fasta file containing the sequences from all metagenomic studies and I found that a big proportion of these contain stop codons using the `"*"` character.

The Prodigal wiki page suggests that due to sequencing errors or organisms with gene decay the output from using `Anonymous mode` should be filtered. 

What is a sensible way to filter sequences containing stop codons? I've observed that the majority of these sequences only have one stop codon at the end of the sequence (84% of the cases), while in other cases they have multiple stop codons or appearing at the start of the sequence. 

Can the information stored in the [fasta header](https://github.com/hyattpd/prodigal/wiki/understanding-the-prodigal-output#protein-translations) (which suggests if the protein is a fragement) be used as a way of validating the predicted sequences? I am mainly interested to filter out probably bad predictions rather than removing valid protein fragments.
 

tagging @hyattpd , @tseemann as I've seen some related discussions in other opened issue https://github.com/hyattpd/Prodigal/issues/30. Please let me know if this question is more appropriate to be posted in the JGI repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protein sequence output containing stop codons "*" #121

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Protein sequence output containing stop codons "*" #121

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions