Skip to content

Add ability to use arbitrary genetic code string (ncbieaa, sncbieaa), make prodigal run 2x faster#118

Open
Artoria2e5 wants to merge 23 commits intohyattpd:GoogleImportfrom
Artoria2e5:gencode-parser
Open

Add ability to use arbitrary genetic code string (ncbieaa, sncbieaa), make prodigal run 2x faster#118
Artoria2e5 wants to merge 23 commits intohyattpd:GoogleImportfrom
Artoria2e5:gencode-parser

Conversation

@Artoria2e5
Copy link
Copy Markdown

@Artoria2e5 Artoria2e5 commented Jun 14, 2025

In #62 someone discussed the idea of allowing Prodigal to use any genetic code string in the NCBI format. This pull request aims to do that.

The basic logic of this change follows the draft in #62 (comment). Briefly:

  • a new function trinuc() is added to bitmap.c to retrieve six bits (3 nucleotides) at the same time
  • amino-acid translation and start/stop decision logic now instead works through looking up a table using these six bits
  • new code is added to convert ncbi strings into the six-bit lookup table and back; interpretation of numeric numeric translation table IDs now simply looks up a built-in table of ncbi strings
  • code that checks for the numeric translation table IDs now directly checks for entries in the six-bit table

Current extent of testing:

  • It compiles.
  • (c5ff17c) On anthus_aco.fas, the translation file is the same as the translation from the release binary save for CRLF/LF.
  • (c5ff17c) On CP146056.1 (-g 4), same translation save for CRLF/LF. As expected the new code is faster, finishing in ~310ms on my machine compared to ~950ms on the old binary.
  • (c5ff17c) On GCF_000195955.2_ASM19595v2_genomic.fna, new code takes 9017 ms while old takes 14752 ms. Very slight difference in translated files (translation start site diffs) likely due to changes already on GoogleImport.

Not tested:

  • Everything else!

Things to do:

  • Compare results on other sequences, especially metagenomes (considering what I did to training.c) and non-11 codes
    • I should write a script for this…
  • Get a bigger input file and compare speed (should be better with all the inlining and elimination of branches on the hot path, right?)
    • The speed up factor is about 2x to 3x, which is a little too consistent. I should recompile the GoogleImport branch and use that as the baseline instead.
  • Add a nuc() to bitmap.h and rewrite is_{a,c,g,t} in terms of it. Only because the current version tests for the opposite thing and then inverts it; this confused the hell out of me.
  • Add a script for converting any new version of the NCBI ASN.1 table into the format in table.c.
    • Directly use the generated files through #include without manual copy paste
  • Replace hardcoded 33/34 with macros.
  • Add a command-line interface for converting the x_prodigal_transl string into usual NCBI format. Maybe a whole additional program for exercising the table.{c,h} features.
  • Put a draft of wiki documentation changes in this PR description.
wiki doc draft

(In: Advice-by-Input-Type)

Non-NCBI Alternate Genetic Codes

Prodigal includes these additional genetic codes in gaps left by NCBI's definition:

  1. Standard genetic code with pyrrolysine on TAG (Kivenson et al. 2021)
  2. Standard genetic code with selenocystine on TGA (biologically meaningless toy example subject to replacement)
  3. Standard genetic code with selenocystine on TGA and pyrrolysine on TAG (biologically meaningless toy example subject to replacement)
  4. Enterosoma/UBA4682 (Shulgina & Eddy 2021)
  5. Peptacetobacter (Shulgina & Eddy 2021)
  6. Anaerococcus, Onthovivens/UBA4855 (Shulgina & Eddy 2021)
  7. Absconditabacterales (Shulgina & Eddy 2021)

These codes can be specified by the -g like any regular NCBI codes.

Prodigal also supports specifying any genetic code in the format used by NCBI's gc.prt string. For example:

  • -g "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG" is the same as the standard genetic code
  • -g "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG,----------**--*--------------------M----------------------------" specifies a code with the same amino-acid translation as the standard genetic code, but initiation only at ATG.

When such a genetic code is used, the ordinary transl_table=...; metadata can no longer apply. Instead, Prodigal dumps a x_prodigal_transl= string describing the table in its own format. For details on this format, including how to convert it to the ordinary NCBI format, see Translation Table Internals.

Caveat: Prodigal only considers initiation at ATG, GTG, or TTG. Other initiation codons are automatically removed during parsing. This also applies to NCBI genetic code tables.

Caveat: Prodigal handles all conditional stop codons as basic "always-on" stop codons. They are a eukaryote thing anyways.

(In: New file Translation Table Internals)
Prodigal internally parses any genetic code into a sequence of 64 8-bit characters arranged in the internal codon order used by Prodigal. The characters can be one of:

  • An uppercase letter, which indicates translation to this amino acid with no possibility of initiation
  • A lowercase letter, which indicates translation to this amino acid with possibility of initiation as Met

For conversion between this format and the universal NCBI format, prodigal comes with a tool called prodigal-table.

Using a Prodigal-style string directly

-g comes with a "backdoor" for passing a 64-byte internal table directly. When the parameter passed is exactly 64-bytes long and contains any lowercase letter or byte with the 8th bit set, Prodigal assumes that the parameter is already in the internal format and perform no parsing at all. This can be used to circumvent the masking of unsupported initiation codons, but we do not recommend its use beyond debugging and experimentation.

Example: KEQ*RGR*TAPSiVLLKEQ*RGRWTAPSmvllNDHYSGRCTAPSiVLFNDHYSGRCTAPSiVLF is the NCBI table 11 with all its included start codons. However, this does not work because all cases of is_start() is also guarded by a check for one of the three modeled starts!

What is needed for other stop codons to work? First we need to add a new node type in sequence.h. Then un-guard the cases of is_start and assign the other possibilities to a the new type. After that we need to expand type_wt[] in training.h by one element. The last part will break the loading of old training files unless we write a converter, which is too much work.


  • Check whether we can in fact import old training files (yes, this should be the job of the nonexistent testing script).
  • Instead of warning about mismatched transl_table (or the sixbit version) in training imports, warn about mismatched start/stop (sncbieaa) only.
    • It's not hard to write (see folded block), but it just feels pointless.
    • Or is it? Eh what the heck it's not that many lines.
warn about mismatch sncbieaa only
      if(user_tt >= 0 && user_tt != tinf.trans_table) { 
        fprintf(stderr, "\n\nWarning: user-specified translation table %d "
          "does", user_tt);
        fprintf(stderr, "not match the one in the specified training file "
          "%d!\n", tinf.trans_table);
        if(strcmp(predefined_tables[user_tt][1],
           predefined_tables[tinf.trans_table][1]) == 0) {
          fprintf(stderr, "But the start and stop codons match! Using user-"
            "specified table.\n\n");
          memcpy(tinf.table, orig_table, 64);
        } else
          fprintf(stderr, "The table from the training data will be used.\n\n");
      }
      if((user_tt == -1 || tinf.trans_table == -1) &&
        memcmp(orig_table, tinf.table, 64) != 0) {
        fprintf(stderr, "\n\nWarning: user-specified translation table %.64s "
          "does\n", orig_table);
        fprintf(stderr, "not match the one in the specified training file "
          "%.64s!\n\n", tinf.table);
        char junk[65], s_user[65], s_train[65];
        table_to_eaa(orig_table, junk, s_user);
        table_to_eaa(tinf.table, junk, s_train);
        if(strcmp(predefined_tables[user_tt][1],
           predefined_tables[tinf.trans_table][1]) == 0) {
          fprintf(stderr, "But the start and stop codons match! Using user-"
            "specified table.\n\n");
          memcpy(tinf.table, orig_table, 64);
        } else
          fprintf(stderr, "The table from the training data will be used.\n\n");
      }

Things I find fishy but am too afraid to touch:

  • In sequence.c, the shine_dalgarno functions have double match[6], cur_ctr, dis_flag. But by reading the code I don't see how they can ever be not-integers in a small range.

The very short functions in bitmap.c are called extremely frequently. Having them  inlined would likely allow further optimizations.

Now this change would not be needed if the program was built with -flto, but (1) that is not what the Makefile says (2) doing this change is such a very-low-hanging fruit that bringing in the big gun of LTO (with the compile time increases) is not exactly justified.
Because the routines in sequence sometimes return 0x80...
Not quite an ASN.1 parser, but you can tell I've had fun writing it
Comment thread training.h
@Artoria2e5 Artoria2e5 marked this pull request as ready for review June 16, 2025 18:14
Also add a tokenize pass to make the parser more robust to whitespace
and comments
@Artoria2e5
Copy link
Copy Markdown
Author

Okay. -p meta gives the same result on the mini test file, but loading a training file generated by the old version on our mini test file somehow does not.

@Artoria2e5
Copy link
Copy Markdown
Author

Artoria2e5 commented Jun 17, 2025

The results for importing an old train file are close enough (!) now, using CP146056.1 (-g 4 -t train). The differences seem to be due to later changes.

@Artoria2e5 Artoria2e5 changed the title Add ability to use arbitrary genetic code string (ncbieaa, sncbieaa) Add ability to use arbitrary genetic code string (ncbieaa, sncbieaa) and accidentally double speed Jun 26, 2025
@Artoria2e5 Artoria2e5 changed the title Add ability to use arbitrary genetic code string (ncbieaa, sncbieaa) and accidentally double speed Add ability to use arbitrary genetic code string (ncbieaa, sncbieaa) and make prodigal 2x faster Jun 26, 2025
@Artoria2e5 Artoria2e5 changed the title Add ability to use arbitrary genetic code string (ncbieaa, sncbieaa) and make prodigal 2x faster Add ability to use arbitrary genetic code string (ncbieaa, sncbieaa), make prodigal run 2x faster, make build a bit faster too Jun 26, 2025
@Artoria2e5
Copy link
Copy Markdown
Author

Artoria2e5 commented Jun 27, 2025

Two regressions from later changes. 514d277 caused results from -p meta to go whack with wildly different model choices. 068c972 caused rbs_motif print to go wrong. I think there's a lesson here about not messing with what works here. Anyways, rolling them back on the PR branch while I figure out what's going on.

I really like these changes: they are not required for the 2x speedup but they address some other parts that stick out to me (two separate orders for multi-base bitstrings, wasted entries in motif[4][4][4096] due to 3/4/5-mers not filling the 4096, the build time, apparently unjustified double in shine_dalgarno_* which turned out to be two pretty hot functions). I hope I can get them back in a non-breaking form soon, but they can be in a different PR anyways.

@Artoria2e5 Artoria2e5 changed the title Add ability to use arbitrary genetic code string (ncbieaa, sncbieaa), make prodigal run 2x faster, make build a bit faster too Add ability to use arbitrary genetic code string (ncbieaa, sncbieaa), make prodigal run 2x faster Jun 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant