proChariot: An LLM-guided Post-Annotation Synthesizer for prokaryotic genomes

proChariot is a command-line tool and python package that leverages large language models (LLMs) to synthesize and analyze prokaryotic genome annotations generated by tools like Bakta. It aims to provide researchers with a high-level summary of key genomic features, functional clusters, and potential errors in the annotation, significantly speeding up the review process.

It is designed to avoid the pitfalls of LLM hallucination by enforcing a structured JSON output schema and incorporating verification steps, as well as being designed to not sound overly confident in its output, to avoid misleading users. Always double-check critical findings with traditional methods.

This project is in active, early-stage development (v.0.1.0).

The code is not stable and many features are incomplete. It is not ready for production use. This README outlines the vision and development roadmap.

The Problem

Tools like Bakta and Prokka provide excellent, comprehensive annotations for bacterial genomes. However, their output is often a massive file, making it difficult for a researcher to quickly see the "big picture."

Identifying functional clusters, like pathogenicity islands, resistance operons, or prophages, requires a time-consuming manual review.

The Solution

proChariot is not an Annotator. It's a Post-Annotation Synthesizer: An LLM-guided analysis tool for prokaryotic genomes.

It works by:

Reading the entire output folder from Bakta (Prokka support is on the roadmap).
Resolving the data into functional gene clusters and key features by querying a large language model with a thoroughly engineered system prompt.
Returning two files:
- synthesis_summary.txt: A human-readable high-level summary of the genome's key features.
- genome_features.json: A structured, machine-readable JSON report suitable for downstream analysis.

The goal is to automate the synthesis step and turn a 2-hour manual review into a 30-second "first-pass" analysis.

Pipeline-Ready Output

A core design principle of proChariot is its dual-output system, which separates human-readable insights from machine-readable data.

synthesis_summary.txt (For Humans): A clean, high-level summary designed for a researcher to read quickly.
genome_features.json (For Machines): A stable, hierarchical JSON report. This structured data is interoperable and pipeline-ready, allowing you to:
- Parse the results easily in downstream Python or R scripts.
- Filter the data programmatically (e.g., jq .contigs[0].potential_features).
- Enable automated verification (for the planned "Hallucination Detection" feature).

This design means proChariot can function as a synthesis step in a larger, automated bioinformatics pipeline.

Why Prochariot? The "Superbug" Case Study

proChariot was born from a real-world analysis where standard automated pipelines failed.

In a recent project analyzing a VRE E. faecium "superbug," standard AMR screening tools missed a critical, high-copy Multi-Drug Resistant (MDR) plasmid. The threat was only identified through a time-consuming, manual, LLM-guided analysis of the full genome.

proChariot is being built to automate and scale that expert-level analysis. Its role as a pipeline tool is to guide and prioritize the manual review. It turns a multi-hour search into a 30-second automated synthesis that generates both a machine-readable genome_features.json file and a human-readable synthesis_summary.txt, allowing researchers to focus their efforts on verifying the key features proChariot has identified.

Roadmap to first release (v.1.0.0)

[x] Core CLI Structure: Set up pyproject.toml, cli.py CLI, and core.py skeletons.
[ ] TSV Parser: Implement a basic parser for Bakta .tsv files that converts them into JSON.
[ ] Core Engine: Implement the main analyze() function to call the Groq API.
[ ] Prompt Engineering: Develop the initial system prompt to guide the LLM towards accurate, relevant synthesis.
[ ] Structured JSON Output: Enforce a reliable, hierarchical JSON output schema via API-level control.
[ ] Dual-File Output: Implement logic to extract synthesis_summary.txt from the main JSON output.
[ ] Basic Testing: Create simple pytest checks for the CLI.
[ ] Documentation: Write the initial user guide.
[ ] Release: Publish the first version to PyPI (pip install prochariot).

Future Enhancements

For a developing list, see the issues section on GitHub.

[ ] Modular Frameworks: Implement a --framework flag to select different analytical prompts (clinical, metabolic, ecology) from a prompts.yaml file.
[ ] Ensemble Analysis: Add an option to perform "light agentic" consensus reporting.
[ ] Error Detection: Implement checks in the system prompt to identify common annotation errors.
[ ] Hallucination Detection: Add a verification step to cross-check the LLM's claims against the input data.
[ ] Prokka Support: Make the tool compatible with Prokka.
[ ] .gff Parser: Implement a parser for .gff files for richer analysis.
[ ] Distribution: Package for Conda-Forge.

Example Usage (v.1.0.0)

This shows the intended command for the v.1.0.0 release.

# Run a full analysis on a Bakta output folder, specifying the species
# and providing an optional note for context.
# The tool will save 'genome_features.json' and 'synthesis_summary.md'
# into the 'my_results' folder.

prochariot -i /path/to/bakta_output_folder/ \
           -o /path/to/my_results/ \
           -s "Enterococcus faecium" \
           -n "Clinical isolate, ST177"

Development Setup

To contribute or run the tool locally in development:

Clone the repository:

git clone https://github.com/DelusionalSimon/prochariot.git
cd prochariot

Create and activate the Conda environment:

conda create --name prochariot_env python=3.10
conda activate prochariot_env

Install the package in editable mode with development dependencies:
```
pip install -e .[dev]
```

Set your API Key:

export GROQ_API_KEY="your-api-key-here"

Test the installation:
```
prochariot --help
```

License

Distributed under the MIT License. See LICENSE for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
src/prochariot		src/prochariot
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

proChariot: An LLM-guided Post-Annotation Synthesizer for prokaryotic genomes

The Problem

The Solution

Pipeline-Ready Output

Why Prochariot? The "Superbug" Case Study

Roadmap to first release (v.1.0.0)

Future Enhancements

Example Usage (v.1.0.0)

Development Setup

License

About

Uh oh!

Releases

Packages

Languages

License

DelusionalSimon/prochariot

Folders and files

Latest commit

History

Repository files navigation

proChariot: An LLM-guided Post-Annotation Synthesizer for prokaryotic genomes

The Problem

The Solution

Pipeline-Ready Output

Why Prochariot? The "Superbug" Case Study

Roadmap to first release (v.1.0.0)

Future Enhancements

Example Usage (v.1.0.0)

Development Setup

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages