Skip to content

Protein sequence output containing stop codons "*" #121

@Ieremie

Description

@Ieremie

Hello,

From my understanding, the JGI/IMG pipeline uses Prodigal to predict the metagenomic protein sequences from the contigs. I have compiled a protein fasta file containing the sequences from all metagenomic studies and I found that a big proportion of these contain stop codons using the "*" character.

The Prodigal wiki page suggests that due to sequencing errors or organisms with gene decay the output from using Anonymous mode should be filtered.

What is a sensible way to filter sequences containing stop codons? I've observed that the majority of these sequences only have one stop codon at the end of the sequence (84% of the cases), while in other cases they have multiple stop codons or appearing at the start of the sequence.

Can the information stored in the fasta header (which suggests if the protein is a fragement) be used as a way of validating the predicted sequences? I am mainly interested to filter out probably bad predictions rather than removing valid protein fragments.

tagging @hyattpd , @tseemann as I've seen some related discussions in other opened issue #30. Please let me know if this question is more appropriate to be posted in the JGI repository.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions