Hello,
From my understanding, the JGI/IMG pipeline uses Prodigal to predict the metagenomic protein sequences from the contigs. I have compiled a protein fasta file containing the sequences from all metagenomic studies and I found that a big proportion of these contain stop codons using the "*" character.
The Prodigal wiki page suggests that due to sequencing errors or organisms with gene decay the output from using Anonymous mode should be filtered.
What is a sensible way to filter sequences containing stop codons? I've observed that the majority of these sequences only have one stop codon at the end of the sequence (84% of the cases), while in other cases they have multiple stop codons or appearing at the start of the sequence.
Can the information stored in the fasta header (which suggests if the protein is a fragement) be used as a way of validating the predicted sequences? I am mainly interested to filter out probably bad predictions rather than removing valid protein fragments.
tagging @hyattpd , @tseemann as I've seen some related discussions in other opened issue #30. Please let me know if this question is more appropriate to be posted in the JGI repository.
Hello,
From my understanding, the JGI/IMG pipeline uses Prodigal to predict the metagenomic protein sequences from the contigs. I have compiled a protein fasta file containing the sequences from all metagenomic studies and I found that a big proportion of these contain stop codons using the
"*"character.The Prodigal wiki page suggests that due to sequencing errors or organisms with gene decay the output from using
Anonymous modeshould be filtered.What is a sensible way to filter sequences containing stop codons? I've observed that the majority of these sequences only have one stop codon at the end of the sequence (84% of the cases), while in other cases they have multiple stop codons or appearing at the start of the sequence.
Can the information stored in the fasta header (which suggests if the protein is a fragement) be used as a way of validating the predicted sequences? I am mainly interested to filter out probably bad predictions rather than removing valid protein fragments.
tagging @hyattpd , @tseemann as I've seen some related discussions in other opened issue #30. Please let me know if this question is more appropriate to be posted in the JGI repository.