proChariot is a command-line tool and python package that leverages large language models (LLMs) to synthesize and analyze prokaryotic genome annotations generated by tools like Bakta. It aims to provide researchers with a high-level summary of key genomic features, functional clusters, and potential errors in the annotation, significantly speeding up the review process.
It is designed to avoid the pitfalls of LLM hallucination by enforcing a structured JSON output schema and incorporating verification steps, as well as being designed to not sound overly confident in its output, to avoid misleading users. Always double-check critical findings with traditional methods.
This project is in active, early-stage development (v.0.1.0).
The code is not stable and many features are incomplete. It is not ready for production use. This
READMEoutlines the vision and development roadmap.
Tools like Bakta and Prokka provide excellent, comprehensive annotations for bacterial genomes. However, their output is often a massive file, making it difficult for a researcher to quickly see the "big picture."
Identifying functional clusters, like pathogenicity islands, resistance operons, or prophages, requires a time-consuming manual review.
proChariot is not an Annotator. It's a Post-Annotation Synthesizer: An LLM-guided analysis tool for prokaryotic genomes.
It works by:
- Reading the entire output folder from Bakta (Prokka support is on the roadmap).
- Resolving the data into functional gene clusters and key features by querying a large language model with a thoroughly engineered system prompt.
- Returning two files:
synthesis_summary.txt: A human-readable high-level summary of the genome's key features.genome_features.json: A structured, machine-readable JSON report suitable for downstream analysis.
The goal is to automate the synthesis step and turn a 2-hour manual review into a 30-second "first-pass" analysis.
A core design principle of proChariot is its dual-output system, which separates human-readable insights from machine-readable data.
synthesis_summary.txt(For Humans): A clean, high-level summary designed for a researcher to read quickly.genome_features.json(For Machines): A stable, hierarchical JSON report. This structured data is interoperable and pipeline-ready, allowing you to:- Parse the results easily in downstream Python or R scripts.
- Filter the data programmatically (e.g.,
jq .contigs[0].potential_features). - Enable automated verification (for the planned "Hallucination Detection" feature).
This design means proChariot can function as a synthesis step in a larger, automated bioinformatics pipeline.
proChariot was born from a real-world analysis where standard automated pipelines failed.
In a recent project analyzing a VRE E. faecium "superbug," standard AMR screening tools missed a critical, high-copy Multi-Drug Resistant (MDR) plasmid. The threat was only identified through a time-consuming, manual, LLM-guided analysis of the full genome.
proChariot is being built to automate and scale that expert-level analysis. Its role as a pipeline tool is to guide and prioritize the manual review. It turns a multi-hour search into a 30-second automated synthesis that generates both a machine-readable genome_features.json file and a human-readable synthesis_summary.txt, allowing researchers to focus their efforts on verifying the key features proChariot has identified.
[x]Core CLI Structure: Set uppyproject.toml,cli.pyCLI, andcore.pyskeletons.[ ]TSV Parser: Implement a basic parser for Bakta.tsvfiles that converts them into JSON.[ ]Core Engine: Implement the mainanalyze()function to call the Groq API.[ ]Prompt Engineering: Develop the initial system prompt to guide the LLM towards accurate, relevant synthesis.[ ]Structured JSON Output: Enforce a reliable, hierarchical JSON output schema via API-level control.[ ]Dual-File Output: Implement logic to extractsynthesis_summary.txtfrom the main JSON output.[ ]Basic Testing: Create simplepytestchecks for the CLI.[ ]Documentation: Write the initial user guide.[ ]Release: Publish the first version to PyPI (pip install prochariot).
For a developing list, see the issues section on GitHub.
[ ]Modular Frameworks: Implement a--frameworkflag to select different analytical prompts (clinical,metabolic,ecology) from aprompts.yamlfile.[ ]Ensemble Analysis: Add an option to perform "light agentic" consensus reporting.[ ]Error Detection: Implement checks in the system prompt to identify common annotation errors.[ ]Hallucination Detection: Add a verification step to cross-check the LLM's claims against the input data.[ ]Prokka Support: Make the tool compatible with Prokka.[ ].gff Parser: Implement a parser for.gfffiles for richer analysis.[ ]Distribution: Package for Conda-Forge.
This shows the intended command for the v.1.0.0 release.
# Run a full analysis on a Bakta output folder, specifying the species
# and providing an optional note for context.
# The tool will save 'genome_features.json' and 'synthesis_summary.md'
# into the 'my_results' folder.
prochariot -i /path/to/bakta_output_folder/ \
-o /path/to/my_results/ \
-s "Enterococcus faecium" \
-n "Clinical isolate, ST177"
To contribute or run the tool locally in development:
-
Clone the repository:
git clone https://github.com/DelusionalSimon/prochariot.git cd prochariot -
Create and activate the Conda environment:
conda create --name prochariot_env python=3.10 conda activate prochariot_env -
Install the package in editable mode with development dependencies:
pip install -e .[dev] -
Set your API Key:
export GROQ_API_KEY="your-api-key-here" -
Test the installation:
prochariot --help
Distributed under the MIT License. See LICENSE for more information.