Open
Conversation
Introduce a reentrant C library API for Prodigal gene prediction, enabling embedding without shelling out. Designed for GPL-boundary integration (Rust FFI, Arrow IPC) following DEVELOPER-GUIDANCE.md. New files: - prodigal.h: Public API with opaque context, SOA/AOS output structs, config with struct_size versioning, error codes, callbacks, allocator hooks, and extern "C" guards - prodigal_internal.h: Internal context struct definition - prodigal_api.c: Full implementation of config, context lifecycle, buffer-based sequence input, training pipeline, training serialization, single-genome and metagenomic gene finding, SOA/AOS extraction with 16-byte aligned single backing allocation, custom allocator support, log/progress callbacks with cancellation, training parameter setters - test_api.c: 40-test suite covering all phases — validated against native Prodigal reference output (22 sequences, exact coordinate match) - testdata/ground_truth/: Reference outputs from native Prodigal for both metagenomic and single-genome modes Modified files: - Makefile: Produces libprodigal.a, libprodigal.so, prodigal CLI, and test_api runner. Core objects separated from CLI. - main.c: Added #ifndef PRODIGAL_NO_MAIN guard for library builds No changes to algorithm code (node.c, dprog.c, gene.c, sequence.c, training.c, metagenomic.c, bitmap.c). All existing behavior preserved. Remaining: Phase 11 (CLI adaptation to use library) and Phase 14 (final regression and cleanup). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rewrite main.c to use prodigal_ctx_t for all state management: - Replace direct malloc/free of seq/rseq/useq/nodes/genes with prodigal_create()/prodigal_destroy() - Replace inline training pipeline with prodigal_train() - Use prodigal_config_t for CLI option passing - Keep existing FILE* I/O for input parsing and output formatting via prodigal_internal.h access to context internals Verified byte-identical output against native Prodigal binary for: - Metagenomic mode: GFF, GBK, SCO, proteins, nucleotides - Single-genome mode: GFF (with and without training file) - Training file: binary round-trip identical Also: update .gitignore for build artifacts (libprodigal.a/so, test_api) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document the full public API with Doxygen-style comments covering: - Module overview with lifecycle diagram and two usage patterns - Memory model (context=system malloc, buffers=custom, output=system) - SOA/AOS output layout and ownership semantics - Per-function docs: parameters, return values, error codes, thread safety - Training modes: from sequences, binary blob, fine-grained setters - Metagenomic mode: lazy initialization, built-in models - Callback contracts: log_callback, progress_callback cancellation - Error handling: codes, strerror, last_error, context reuse after errors - GPL-boundary integration: struct_size, NO_MAIN, build patterns - Three complete code examples (meta mode, training, custom allocator) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI matrix: - Linux: gcc + clang, full regression suite (all formats, training round-trip) - macOS: default clang, tests + key regressions - WASM: Emscripten build + test suite in Node.js - Windows: MSYS2/MinGW-w64, full build + tests + metagenomic regression Portability fix in prodigal_api.c: - Replace bare posix_memalign() with platform-aware aligned allocation: Windows: _aligned_malloc/_aligned_free C11 (non-Apple): aligned_alloc POSIX fallback: posix_memalign - This was the only library-code blocker for Windows and WASM builds Platform analysis: - Library (libprodigal.a): fully portable to Linux, macOS, Windows, WASM. No POSIX-only APIs, no zlib dependency, no file I/O. - CLI (prodigal binary): requires POSIX for stdin detection (fileno, fstat, S_ISFIFO) and /dev/stdin. Works on Linux, macOS, Windows/MSYS2. Native MSVC CLI would need additional #ifdef _WIN32 guards. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…bjects GPL-boundary compliance: the library must never write to stderr. - gene.c:54: guard the MAX_GENES fprintf(stderr) with #ifndef PRODIGAL_NO_MAIN - Makefile: library objects (.lib.o) now compiled with -DPRODIGAL_NO_MAIN, CLI objects (.o) compiled without it. This ensures the static library (libprodigal.a) has the stderr guard active while the CLI binary retains its diagnostic output. - prodigal_api.c: portable aligned allocation already handles _WIN32 The output-formatting functions (print_genes, write_translations, etc.) in gene.c still contain fprintf(fp, ...) calls that write to a FILE* parameter. These are never invoked through the library API (which returns structured SOA/AOS data). They exist in the compilation unit as dead code from the library's perspective and are stripped by the linker. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Windows fix: - PRODIGAL_API macro defaulted to __declspec(dllimport) on Windows, causing linker errors (__imp_prodigal_*) when statically linking. Changed to only use dllimport when PRODIGAL_DLL is explicitly defined. Static linking (the default) now uses an empty PRODIGAL_API on Windows. macOS fix: - Training file (.bin) is a raw struct dump; struct _training has different padding on ARM64 vs x86_64. macOS CI now uses self-consistent round-trip tests instead of comparing to Linux-generated reference binaries. Text output (GFF, GBK, etc.) is architecture-independent and still compared to reference files. Also: add -DPRODIGAL_NO_MAIN to WASM emcc builds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff command was not found in the MSYS2 base install. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
In working on the MIINT DuckDB extension, we wanted to add embedded support for Prodigal. The motivation for embedding is to avoid the overhead of the filesystem and serialization/deserialization steps.
This pull request addresses the missing API component which is a generally useful expansion to the project for others wishing to embed Prodigal. To help future proof use, we expanded CI to build against Windows and WASM as DuckDB natively supports these targets.
Prodigal is GPL, so we cannot directly embed in MIINT as it is BSD. To overcome this, Prodigal has been added to a GPL-boundary, which will be hooked up to MIINT in the coming days.