Skip to content

Reentrant library API#123

Open
wasade wants to merge 10 commits intohyattpd:GoogleImportfrom
the-miint:v2.6.4-miint
Open

Reentrant library API#123
wasade wants to merge 10 commits intohyattpd:GoogleImportfrom
the-miint:v2.6.4-miint

Conversation

@wasade
Copy link
Copy Markdown

@wasade wasade commented Apr 2, 2026

In working on the MIINT DuckDB extension, we wanted to add embedded support for Prodigal. The motivation for embedding is to avoid the overhead of the filesystem and serialization/deserialization steps.

This pull request addresses the missing API component which is a generally useful expansion to the project for others wishing to embed Prodigal. To help future proof use, we expanded CI to build against Windows and WASM as DuckDB natively supports these targets.

Prodigal is GPL, so we cannot directly embed in MIINT as it is BSD. To overcome this, Prodigal has been added to a GPL-boundary, which will be hooked up to MIINT in the coming days.

wasade and others added 10 commits March 31, 2026 13:22
Introduce a reentrant C library API for Prodigal gene prediction,
enabling embedding without shelling out. Designed for GPL-boundary
integration (Rust FFI, Arrow IPC) following DEVELOPER-GUIDANCE.md.

New files:
- prodigal.h: Public API with opaque context, SOA/AOS output structs,
  config with struct_size versioning, error codes, callbacks, allocator
  hooks, and extern "C" guards
- prodigal_internal.h: Internal context struct definition
- prodigal_api.c: Full implementation of config, context lifecycle,
  buffer-based sequence input, training pipeline, training serialization,
  single-genome and metagenomic gene finding, SOA/AOS extraction with
  16-byte aligned single backing allocation, custom allocator support,
  log/progress callbacks with cancellation, training parameter setters
- test_api.c: 40-test suite covering all phases — validated against
  native Prodigal reference output (22 sequences, exact coordinate match)
- testdata/ground_truth/: Reference outputs from native Prodigal for
  both metagenomic and single-genome modes

Modified files:
- Makefile: Produces libprodigal.a, libprodigal.so, prodigal CLI, and
  test_api runner. Core objects separated from CLI.
- main.c: Added #ifndef PRODIGAL_NO_MAIN guard for library builds

No changes to algorithm code (node.c, dprog.c, gene.c, sequence.c,
training.c, metagenomic.c, bitmap.c). All existing behavior preserved.

Remaining: Phase 11 (CLI adaptation to use library) and Phase 14
(final regression and cleanup).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rewrite main.c to use prodigal_ctx_t for all state management:
- Replace direct malloc/free of seq/rseq/useq/nodes/genes with
  prodigal_create()/prodigal_destroy()
- Replace inline training pipeline with prodigal_train()
- Use prodigal_config_t for CLI option passing
- Keep existing FILE* I/O for input parsing and output formatting
  via prodigal_internal.h access to context internals

Verified byte-identical output against native Prodigal binary for:
- Metagenomic mode: GFF, GBK, SCO, proteins, nucleotides
- Single-genome mode: GFF (with and without training file)
- Training file: binary round-trip identical

Also: update .gitignore for build artifacts (libprodigal.a/so, test_api)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document the full public API with Doxygen-style comments covering:
- Module overview with lifecycle diagram and two usage patterns
- Memory model (context=system malloc, buffers=custom, output=system)
- SOA/AOS output layout and ownership semantics
- Per-function docs: parameters, return values, error codes, thread safety
- Training modes: from sequences, binary blob, fine-grained setters
- Metagenomic mode: lazy initialization, built-in models
- Callback contracts: log_callback, progress_callback cancellation
- Error handling: codes, strerror, last_error, context reuse after errors
- GPL-boundary integration: struct_size, NO_MAIN, build patterns
- Three complete code examples (meta mode, training, custom allocator)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CI matrix:
- Linux: gcc + clang, full regression suite (all formats, training round-trip)
- macOS: default clang, tests + key regressions
- WASM: Emscripten build + test suite in Node.js
- Windows: MSYS2/MinGW-w64, full build + tests + metagenomic regression

Portability fix in prodigal_api.c:
- Replace bare posix_memalign() with platform-aware aligned allocation:
  Windows: _aligned_malloc/_aligned_free
  C11 (non-Apple): aligned_alloc
  POSIX fallback: posix_memalign
- This was the only library-code blocker for Windows and WASM builds

Platform analysis:
- Library (libprodigal.a): fully portable to Linux, macOS, Windows, WASM.
  No POSIX-only APIs, no zlib dependency, no file I/O.
- CLI (prodigal binary): requires POSIX for stdin detection (fileno, fstat,
  S_ISFIFO) and /dev/stdin. Works on Linux, macOS, Windows/MSYS2.
  Native MSVC CLI would need additional #ifdef _WIN32 guards.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…bjects

GPL-boundary compliance: the library must never write to stderr.

- gene.c:54: guard the MAX_GENES fprintf(stderr) with #ifndef PRODIGAL_NO_MAIN
- Makefile: library objects (.lib.o) now compiled with -DPRODIGAL_NO_MAIN,
  CLI objects (.o) compiled without it. This ensures the static library
  (libprodigal.a) has the stderr guard active while the CLI binary retains
  its diagnostic output.
- prodigal_api.c: portable aligned allocation already handles _WIN32

The output-formatting functions (print_genes, write_translations, etc.) in
gene.c still contain fprintf(fp, ...) calls that write to a FILE* parameter.
These are never invoked through the library API (which returns structured
SOA/AOS data). They exist in the compilation unit as dead code from the
library's perspective and are stripped by the linker.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Windows fix:
- PRODIGAL_API macro defaulted to __declspec(dllimport) on Windows, causing
  linker errors (__imp_prodigal_*) when statically linking. Changed to only
  use dllimport when PRODIGAL_DLL is explicitly defined. Static linking
  (the default) now uses an empty PRODIGAL_API on Windows.

macOS fix:
- Training file (.bin) is a raw struct dump; struct _training has different
  padding on ARM64 vs x86_64. macOS CI now uses self-consistent round-trip
  tests instead of comparing to Linux-generated reference binaries.
  Text output (GFF, GBK, etc.) is architecture-independent and still
  compared to reference files.

Also: add -DPRODIGAL_NO_MAIN to WASM emcc builds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff command was not found in the MSYS2 base install.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant