Add ability to use arbitrary genetic code string (ncbieaa, sncbieaa), make prodigal run 2x faster#118
Add ability to use arbitrary genetic code string (ncbieaa, sncbieaa), make prodigal run 2x faster#118Artoria2e5 wants to merge 23 commits intohyattpd:GoogleImportfrom
Conversation
The very short functions in bitmap.c are called extremely frequently. Having them inlined would likely allow further optimizations. Now this change would not be needed if the program was built with -flto, but (1) that is not what the Makefile says (2) doing this change is such a very-low-hanging fruit that bringing in the big gun of LTO (with the compile time increases) is not exactly justified.
Because the routines in sequence sometimes return 0x80...
Not quite an ASN.1 parser, but you can tell I've had fun writing it
39ffbcc to
a4aa8cc
Compare
Also add a tokenize pass to make the parser more robust to whitespace and comments
|
Okay. |
1fa60a1 to
1f16d0f
Compare
|
The results for importing an old train file are close enough (!) now, using CP146056.1 ( |
1121060 to
f095287
Compare
|
Two regressions from later changes. 514d277 caused results from I really like these changes: they are not required for the 2x speedup but they address some other parts that stick out to me (two separate orders for multi-base bitstrings, wasted entries in motif[4][4][4096] due to 3/4/5-mers not filling the 4096, the build time, apparently unjustified double in shine_dalgarno_* which turned out to be two pretty hot functions). I hope I can get them back in a non-breaking form soon, but they can be in a different PR anyways. |
f095287 to
e9ed44a
Compare
In #62 someone discussed the idea of allowing Prodigal to use any genetic code string in the NCBI format. This pull request aims to do that.
The basic logic of this change follows the draft in #62 (comment). Briefly:
trinuc()is added to bitmap.c to retrieve six bits (3 nucleotides) at the same timeCurrent extent of testing:
-g 4), same translation save for CRLF/LF. As expected the new code is faster, finishing in ~310ms on my machine compared to ~950ms on the old binary.GCF_000195955.2_ASM19595v2_genomic.fna, new code takes 9017 ms while old takes 14752 ms. Very slight difference in translated files (translation start site diffs) likely due to changes already on GoogleImport.Not tested:
Things to do:
#includewithout manual copy pastex_prodigal_translstring into usual NCBI format. Maybe a whole additional program for exercising the table.{c,h} features.wiki doc draft
(In: Advice-by-Input-Type)
Non-NCBI Alternate Genetic Codes
Prodigal includes these additional genetic codes in gaps left by NCBI's definition:
These codes can be specified by the
-glike any regular NCBI codes.Prodigal also supports specifying any genetic code in the format used by NCBI's gc.prt string. For example:
-g "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG"is the same as the standard genetic code-g "FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG,----------**--*--------------------M----------------------------"specifies a code with the same amino-acid translation as the standard genetic code, but initiation only at ATG.When such a genetic code is used, the ordinary
transl_table=...;metadata can no longer apply. Instead, Prodigal dumps ax_prodigal_transl=string describing the table in its own format. For details on this format, including how to convert it to the ordinary NCBI format, see Translation Table Internals.Caveat: Prodigal only considers initiation at ATG, GTG, or TTG. Other initiation codons are automatically removed during parsing. This also applies to NCBI genetic code tables.
Caveat: Prodigal handles all conditional stop codons as basic "always-on" stop codons. They are a eukaryote thing anyways.
(In: New file Translation Table Internals)
Prodigal internally parses any genetic code into a sequence of 64 8-bit characters arranged in the internal codon order used by Prodigal. The characters can be one of:
For conversion between this format and the universal NCBI format, prodigal comes with a tool called
prodigal-table.Using a Prodigal-style string directly
-gcomes with a "backdoor" for passing a 64-byte internal table directly. When the parameter passed is exactly 64-bytes long and contains any lowercase letter or byte with the 8th bit set, Prodigal assumes that the parameter is already in the internal format and perform no parsing at all. This can be used to circumvent the masking of unsupported initiation codons, but we do not recommend its use beyond debugging and experimentation.Example:
KEQ*RGR*TAPSiVLLKEQ*RGRWTAPSmvllNDHYSGRCTAPSiVLFNDHYSGRCTAPSiVLFis the NCBI table 11 with all its included start codons. However, this does not work because all cases ofis_start()is also guarded by a check for one of the three modeled starts!Instead of warning about mismatched transl_table (or the sixbit version) in training imports, warn about mismatched start/stop (sncbieaa) only.warn about mismatch sncbieaa only
Things I find fishy but am too afraid to touch:
sequence.c, theshine_dalgarnofunctions havedouble match[6], cur_ctr, dis_flag. But by reading the code I don't see how they can ever be not-integers in a small range.