In doing an analysis of gene presence/absence, I had some false positives that I tracked down to Prodigal not consistently calling the ORF, even though the sequence was present and identical in the other strains. I'll show an example of two genomes here: Mycobacterium tuberculosis isolates N0157 and N1216.
Prodigal Output
I ran prodigal using a source-compiled from the latest git snapshot as
prodigal -i $genome.fasta -c -m -g 11 -p single -f gff
to match the way that Prokka invokes it, although I suppose defaults should be fine since I'm using complete assemblies with no gaps.
N1216
Prodigal calls three consecutive ORFs: The first is Rv0348, the second is a new prediction, and the third is Rv0349.
1 Prodigal_v2.6.3 CDS 421332 421985 70.5 + 0 ID=1_366;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.610;conf=100.00;score=70.47;cscore=48.90;sscore=21.57;rscore=14.24;uscore=1.95;tscore=4.12;
1 Prodigal_v2.6.3 CDS 421940 422143 6.9 - 0 ID=1_367;partial=00;start_type=GTG;rbs_motif=GGAGG;rbs_spacer=5-10bp;gc_cont=0.627;conf=83.03;score=6.91;cscore=-0.31;sscore=7.22;rscore=9.20;uscore=0.73;tscore=-2.21;
1 Prodigal_v2.6.3 CDS 422231 422647 11.2 + 0 ID=1_368;partial=00;start_type=ATG;rbs_motif=3Base/5BMM;rbs_spacer=13-15bp;gc_cont=0.624;conf=92.92;score=11.20;cscore=8.74;sscore=2.46;rscore=-2.24;uscore=0.57;tscore=4.12;
N0157
Prodigal does not call any ORF between Rv0348 and Rv0349.
1 Prodigal_v2.6.3 CDS 421451 422104 70.8 + 0 ID=1_370;partial=00;start_type=ATG;rbs_motif=AGGAG;rbs_spacer=5-10bp;gc_cont=0.609;conf=100.00;score=70.84;cscore=49.41;sscore=21.43;rscore=14.22;uscore=1.85;tscore=4.11;
1 Prodigal_v2.6.3 CDS 422107 422766 21.5 + 0 ID=1_371;partial=00;start_type=GTG;rbs_motif=None;rbs_spacer=None;gc_cont=0.629;conf=99.30;score=21.55;cscore=23.33;sscore=-1.78;rscore=-3.59;uscore=2.32;tscore=-1.76;
Sequence Check
The sequence of the 204bp middle ORF that was called in N1216 is identical to a corresponding sequence in N0157 that is also between Rv0348-9:
$ blastn -query N1216.fasta -query_loc 421940-422143 -subject N0157.fasta -outfmt 7
# BLASTN 2.15.0+
# Query: 1 4393757
# Database: User specified sequence set (Input: /grp/valafar/data/genomes/N0157.fasta)
# Fields: query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 1 hits found
1 1 100.000 204 0 0 421940 422143 422059 422262 1.47e-105 377
# BLAST processed 1 queries
Epilogue and Data
This is one example but I have many false positives that I suspect are similar scenarios. In this particular example, the ORF is called in 49 genomes and not called in 72 genomes. I'm including the genome sequences for the two strains I examined above.
data.tar.gz
@althonos, this issue also exists in pyrodigal 3.6.3 from bioconda.
In doing an analysis of gene presence/absence, I had some false positives that I tracked down to Prodigal not consistently calling the ORF, even though the sequence was present and identical in the other strains. I'll show an example of two genomes here: Mycobacterium tuberculosis isolates N0157 and N1216.
Prodigal Output
I ran prodigal using a source-compiled from the latest git snapshot as
to match the way that Prokka invokes it, although I suppose defaults should be fine since I'm using complete assemblies with no gaps.
N1216
Prodigal calls three consecutive ORFs: The first is Rv0348, the second is a new prediction, and the third is Rv0349.
N0157
Prodigal does not call any ORF between Rv0348 and Rv0349.
Sequence Check
The sequence of the 204bp middle ORF that was called in N1216 is identical to a corresponding sequence in N0157 that is also between Rv0348-9:
Epilogue and Data
This is one example but I have many false positives that I suspect are similar scenarios. In this particular example, the ORF is called in 49 genomes and not called in 72 genomes. I'm including the genome sequences for the two strains I examined above.
data.tar.gz
@althonos, this issue also exists in pyrodigal 3.6.3 from bioconda.