Add ability to use arbitrary genetic code string (ncbieaa, sncbieaa), make prodigal run 2x faster by Artoria2e5 · Pull Request #118 · hyattpd/Prodigal

Artoria2e5 · 2025-06-14T19:08:04Z

In #62 someone discussed the idea of allowing Prodigal to use any genetic code string in the NCBI format. This pull request aims to do that.

The basic logic of this change follows the draft in #62 (comment). Briefly:

a new function trinuc() is added to bitmap.c to retrieve six bits (3 nucleotides) at the same time
amino-acid translation and start/stop decision logic now instead works through looking up a table using these six bits
new code is added to convert ncbi strings into the six-bit lookup table and back; interpretation of numeric numeric translation table IDs now simply looks up a built-in table of ncbi strings
code that checks for the numeric translation table IDs now directly checks for entries in the six-bit table

Current extent of testing:

It compiles.
(c5ff17c) On anthus_aco.fas, the translation file is the same as the translation from the release binary save for CRLF/LF.
(c5ff17c) On CP146056.1 (-g 4), same translation save for CRLF/LF. As expected the new code is faster, finishing in ~310ms on my machine compared to ~950ms on the old binary.
(c5ff17c) On GCF_000195955.2_ASM19595v2_genomic.fna, new code takes 9017 ms while old takes 14752 ms. Very slight difference in translated files (translation start site diffs) likely due to changes already on GoogleImport.

Not tested:

Everything else!

Things to do:

Compare results on other sequences, especially metagenomes (considering what I did to training.c) and non-11 codes
- I should write a script for this…
Get a bigger input file and compare speed (should be better with all the inlining and elimination of branches on the hot path, right?)
- The speed up factor is about 2x to 3x, which is a little too consistent. I should recompile the GoogleImport branch and use that as the baseline instead.
Add a nuc() to bitmap.h and rewrite is_{a,c,g,t} in terms of it. Only because the current version tests for the opposite thing and then inverts it; this confused the hell out of me.
Add a script for converting any new version of the NCBI ASN.1 table into the format in table.c.
- Directly use the generated files through #include without manual copy paste
Replace hardcoded 33/34 with macros.
Add a command-line interface for converting the x_prodigal_transl string into usual NCBI format. Maybe a whole additional program for exercising the table.{c,h} features.
Put a draft of wiki documentation changes in this PR description.

wiki doc draft

(In: Advice-by-Input-Type)

Non-NCBI Alternate Genetic Codes

Prodigal includes these additional genetic codes in gaps left by NCBI's definition:

Standard genetic code with pyrrolysine on TAG (Kivenson et al. 2021)
Standard genetic code with selenocystine on TGA (biologically meaningless toy example subject to replacement)
Standard genetic code with selenocystine on TGA and pyrrolysine on TAG (biologically meaningless toy example subject to replacement)
Enterosoma/UBA4682 (Shulgina & Eddy 2021)
Peptacetobacter (Shulgina & Eddy 2021)
Anaerococcus, Onthovivens/UBA4855 (Shulgina & Eddy 2021)
Absconditabacterales (Shulgina & Eddy 2021)

These codes can be specified by the -g like any regular NCBI codes.

Prodigal also supports specifying any genetic code in the format used by NCBI's gc.prt string. For example:

-g "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG" is the same as the standard genetic code
-g "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG,----------**--*--------------------M----------------------------" specifies a code with the same amino-acid translation as the standard genetic code, but initiation only at ATG.

When such a genetic code is used, the ordinary transl_table=...; metadata can no longer apply. Instead, Prodigal dumps a x_prodigal_transl= string describing the table in its own format. For details on this format, including how to convert it to the ordinary NCBI format, see Translation Table Internals.

Caveat: Prodigal only considers initiation at ATG, GTG, or TTG. Other initiation codons are automatically removed during parsing. This also applies to NCBI genetic code tables.

Caveat: Prodigal handles all conditional stop codons as basic "always-on" stop codons. They are a eukaryote thing anyways.

(In: New file Translation Table Internals)
Prodigal internally parses any genetic code into a sequence of 64 8-bit characters arranged in the internal codon order used by Prodigal. The characters can be one of:

An uppercase letter, which indicates translation to this amino acid with no possibility of initiation
A lowercase letter, which indicates translation to this amino acid with possibility of initiation as Met

For conversion between this format and the universal NCBI format, prodigal comes with a tool called prodigal-table.

Using a Prodigal-style string directly

-g comes with a "backdoor" for passing a 64-byte internal table directly. When the parameter passed is exactly 64-bytes long and contains any lowercase letter or byte with the 8th bit set, Prodigal assumes that the parameter is already in the internal format and perform no parsing at all. This can be used to circumvent the masking of unsupported initiation codons, but we do not recommend its use beyond debugging and experimentation.

Example: KEQ*RGR*TAPSiVLLKEQ*RGRWTAPSmvllNDHYSGRCTAPSiVLFNDHYSGRCTAPSiVLF is the NCBI table 11 with all its included start codons. However, this does not work because all cases of is_start() is also guarded by a check for one of the three modeled starts!

What is needed for other stop codons to work? First we need to add a new node type in sequence.h. Then un-guard the cases of is_start and assign the other possibilities to a the new type. After that we need to expand type_wt[] in training.h by one element. The last part will break the loading of old training files unless we write a converter, which is too much work.

Check whether we can in fact import old training files (yes, this should be the job of the nonexistent testing script).
~~Instead of warning about mismatched transl_table (or the sixbit version) in training imports, warn about mismatched start/stop (sncbieaa) only.~~
- It's not hard to write (see folded block), but it just feels pointless.
- Or is it? Eh what the heck it's not that many lines.

warn about mismatch sncbieaa only

      if(user_tt >= 0 && user_tt != tinf.trans_table) { 
        fprintf(stderr, "\n\nWarning: user-specified translation table %d "
          "does", user_tt);
        fprintf(stderr, "not match the one in the specified training file "
          "%d!\n", tinf.trans_table);
        if(strcmp(predefined_tables[user_tt][1],
           predefined_tables[tinf.trans_table][1]) == 0) {
          fprintf(stderr, "But the start and stop codons match! Using user-"
            "specified table.\n\n");
          memcpy(tinf.table, orig_table, 64);
        } else
          fprintf(stderr, "The table from the training data will be used.\n\n");
      }
      if((user_tt == -1 || tinf.trans_table == -1) &&
        memcmp(orig_table, tinf.table, 64) != 0) {
        fprintf(stderr, "\n\nWarning: user-specified translation table %.64s "
          "does\n", orig_table);
        fprintf(stderr, "not match the one in the specified training file "
          "%.64s!\n\n", tinf.table);
        char junk[65], s_user[65], s_train[65];
        table_to_eaa(orig_table, junk, s_user);
        table_to_eaa(tinf.table, junk, s_train);
        if(strcmp(predefined_tables[user_tt][1],
           predefined_tables[tinf.trans_table][1]) == 0) {
          fprintf(stderr, "But the start and stop codons match! Using user-"
            "specified table.\n\n");
          memcpy(tinf.table, orig_table, 64);
        } else
          fprintf(stderr, "The table from the training data will be used.\n\n");
      }

Things I find fishy but am too afraid to touch:

In sequence.c, the shine_dalgarno functions have double match[6], cur_ctr, dis_flag. But by reading the code I don't see how they can ever be not-integers in a small range.

The very short functions in bitmap.c are called extremely frequently. Having them inlined would likely allow further optimizations. Now this change would not be needed if the program was built with -flto, but (1) that is not what the Makefile says (2) doing this change is such a very-low-hanging fruit that bringing in the big gun of LTO (with the compile time increases) is not exactly justified.

Because the routines in sequence sometimes return 0x80...

Not quite an ASN.1 parser, but you can tell I've had fun writing it

Also add a tokenize pass to make the parser more robust to whitespace and comments

Artoria2e5 · 2025-06-17T12:57:45Z

Okay. -p meta gives the same result on the mini test file, but loading a training file generated by the old version on our mini test file somehow does not.

Artoria2e5 · 2025-06-17T15:18:41Z

The results for importing an old train file are close enough (!) now, using CP146056.1 (-g 4 -t train). The differences seem to be due to later changes.

Artoria2e5 · 2025-06-27T11:28:57Z

Two regressions from later changes. 514d277 caused results from -p meta to go whack with wildly different model choices. 068c972 caused rbs_motif print to go wrong. I think there's a lesson here about not messing with what works here. Anyways, rolling them back on the PR branch while I figure out what's going on.

I really like these changes: they are not required for the 2x speedup but they address some other parts that stick out to me (two separate orders for multi-base bitstrings, wasted entries in motif[4][4][4096] due to 3/4/5-mers not filling the 4096, the build time, apparently unjustified double in shine_dalgarno_* which turned out to be two pretty hot functions). I hope I can get them back in a non-breaking form soon, but they can be in a different PR anyways.

Artoria2e5 added 17 commits June 14, 2025 21:48

bitmap: add trinuc()

d8a5d0f

sequence: add O and U for kicks

142aea5

table: new routines for parsing {,s}ncbieaa

3c7451a

training: put the new table format in

6d27f71

sequence: use the new table thing

a6ec9a3

node: stop checking for == 1, just "if"

f412276

Because the routines in sequence sometimes return 0x80...

main: make use of the new stuff

7c52ed0

make: fix one NULL deref

cd5a29d

Makefile: recompile training less often

b8f2ef2

node: adapt to dynamic table

3dd086d

gene: output custom table as x_prodigal_transl

c5ff17c

maketable.sh: script for making table.c from gc.prt

3e001ae

Not quite an ASN.1 parser, but you can tell I've had fun writing it

replace 33/34 with a macro

956f37b

replace is_a with direct twobit fetch

56903a8

unhelpful table utility

a39fe37

table: use generated values

a4aa8cc

Artoria2e5 force-pushed the gencode-parser branch from 39ffbcc to a4aa8cc Compare June 15, 2025 01:47

sequence,table: no more 0x80

3156d6e

Artoria2e5 commented Jun 15, 2025

View reviewed changes

Comment thread training.h

Artoria2e5 marked this pull request as ready for review June 16, 2025 18:14

Artoria2e5 added 3 commits June 17, 2025 15:50

.gitignore: expand

a9bd078

Separate generated table file

a317260

Also add a tokenize pass to make the parser more robust to whitespace and comments

fixup! bitmap type

e9ed44a

Artoria2e5 force-pushed the gencode-parser branch from 1fa60a1 to 1f16d0f Compare June 17, 2025 13:58

Artoria2e5 changed the title ~~Add ability to use arbitrary genetic code string (ncbieaa, sncbieaa)~~ Add ability to use arbitrary genetic code string (ncbieaa, sncbieaa) and accidentally double speed Jun 26, 2025

Artoria2e5 changed the title ~~Add ability to use arbitrary genetic code string (ncbieaa, sncbieaa) and accidentally double speed~~ Add ability to use arbitrary genetic code string (ncbieaa, sncbieaa) and make prodigal 2x faster Jun 26, 2025

Artoria2e5 changed the title ~~Add ability to use arbitrary genetic code string (ncbieaa, sncbieaa) and make prodigal 2x faster~~ Add ability to use arbitrary genetic code string (ncbieaa, sncbieaa), make prodigal run 2x faster, make build a bit faster too Jun 26, 2025

Artoria2e5 force-pushed the gencode-parser branch from 1121060 to f095287 Compare June 26, 2025 10:40

Artoria2e5 force-pushed the gencode-parser branch from f095287 to e9ed44a Compare June 27, 2025 11:40

Artoria2e5 changed the title ~~Add ability to use arbitrary genetic code string (ncbieaa, sncbieaa), make prodigal run 2x faster, make build a bit faster too~~ Add ability to use arbitrary genetic code string (ncbieaa, sncbieaa), make prodigal run 2x faster Jun 27, 2025

Artoria2e5 added 2 commits June 27, 2025 22:08

compare.sh: text improvement

f1db546

fixup! amino

7ca9dfe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to use arbitrary genetic code string (ncbieaa, sncbieaa), make prodigal run 2x faster#118

Add ability to use arbitrary genetic code string (ncbieaa, sncbieaa), make prodigal run 2x faster#118
Artoria2e5 wants to merge 23 commits intohyattpd:GoogleImportfrom
Artoria2e5:gencode-parser

Artoria2e5 commented Jun 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Artoria2e5 commented Jun 17, 2025

Uh oh!

Artoria2e5 commented Jun 17, 2025 •

edited

Loading

Uh oh!

Artoria2e5 commented Jun 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Artoria2e5 commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Non-NCBI Alternate Genetic Codes

Using a Prodigal-style string directly

Uh oh!

Uh oh!

Artoria2e5 commented Jun 17, 2025

Uh oh!

Artoria2e5 commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Artoria2e5 commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Artoria2e5 commented Jun 14, 2025 •

edited

Loading

Artoria2e5 commented Jun 17, 2025 •

edited

Loading

Artoria2e5 commented Jun 27, 2025 •

edited

Loading