A multi-stage pipeline for extracting Metal-Organic Polyhedra (MOPs) information from scientific papers using MCP-enhanced LLM agents, producing structured knowledge graphs (TTL).
- Python 3.11+
- (Recommended) WSL on Windows for a smoother Linux-like environment
- Docker (only if you use MCP tools that require it; some tools are stdio-only)
# venv
python -m venv .venv
source .venv/bin/activate # Windows PowerShell: .venv\Scripts\Activate.ps1
# or conda
conda create -n mcp_layer python=3.11
conda activate mcp_layerpip install -r requirements.txtThis repo git-ignores many runtime folders (caches, logs, generated prompts/scripts).
Some modules (notably models/locations.py) require directories to exist at import time.
Run:
python scripts/bootstrap_repo.pyIf you plan to run grounding/lookup agents, also create grounding-cache folders:
python scripts/bootstrap_repo.py --with-grounding-cache ontospeciescp configs/mcp_configs.json.example configs/mcp_configs.jsonThen edit configs/mcp_configs.json to reflect your local environment (paths, server commands).
This repo does not ship a committed .env.example. Create .env in the repo root with what your environment expects.
At minimum, many agents expect something like:
API_KEY=...
BASE_URL=...Exact keys depend on your models/ModelConfig.py / models/LLMCreator.py configuration.
After python scripts/bootstrap_repo.py, you should have (among others):
data/(runtime data, cached results; gitignored)data/log/(required; some modules error if missing)data/ontologies/(place ontology T-Box TTLs here)data/grounding_cache/<ontology>/labels(optional; for Script C fuzzy lookup)
raw_data/(PDF inputs; gitignored)sandbox/(scratch scripts; gitignored)ai_generated_contents*/(LLM-generated artifacts; gitignored)
There are two “layers”:
- Ontology-specific MCP lookup server (generated for a given ontology)
- Grounding consumer agent that applies mappings to TTLs
This repo includes configs/grounding.json to run the OntoSpecies lookup server via stdio.
The grounding agent lives at src/agents/grounding/grounding_agent.py.
- Single file:
python -m src.agents.grounding.grounding_agent --ttl path/to/file.ttl --write-grounded-ttl- Batch folder (recursively processes
*.ttl, skipping*_grounded.ttland*link.ttl):
python -m src.agents.grounding.grounding_agent --batch-dir evaluation/data/merged_tll --write-grounded-ttlNotes:
- Internal merge (deduplicating identical nodes across TTLs) runs by default in batch mode; disable with
--no-internal-merge. - Default grounding materialization mode is
replace(replacessource_iriwithgrounded_iri). You can switch tosameaswith--grounding-mode sameas.
The main pipeline entrypoint is mop_main.py (see its CLI help):
python mop_main.py --helpUse the following canonical Python entrypoints to generate plans, prompts, and MCP scripts.
python -m src.agents.scripts_and_prompts_generation.task_division_agent \
--tbox data/ontologies/ontosynthesis.ttl \
--output configs/task_division_plan.json \
--model gpt-5python -m src.agents.scripts_and_prompts_generation.task_prompt_creation_agent \
--version 1 \
--plan configs/task_division_plan.json \
--tbox data/ontologies/ontosynthesis.ttl \
--model gpt-4.1 \
--parallel 3Legacy plan-driven mode (matches the old run_extraction_prompt_creation.sh intent):
python -m src.agents.scripts_and_prompts_generation.task_extraction_prompt_creation_agent \
--version 1 \
--plan configs/task_division_plan.json \
--tbox data/ontologies/ontosynthesis.ttl \
--model gpt-5 \
--parallel 3Iterations-driven mode (uses ontology flags + ai_generated_contents_candidate/iterations/**/iterations.json):
python -m src.agents.scripts_and_prompts_generation.task_extraction_prompt_creation_agent \
--ontosynthesis \
--version 1 \
--model gpt-5 \
--parallel 34) Generate MCP underlying scripts from T-Box (writes into ai_generated_contents_candidate/scripts/…)
All ontologies from ape_generated_contents/meta_task_config.json:
python -m src.agents.scripts_and_prompts_generation.mcp_underlying_script_creation_agent --allSingle ontology (by short name or by TTL path):
python -m src.agents.scripts_and_prompts_generation.mcp_underlying_script_creation_agent \
--ontology ontosynthesis \
--model gpt-5 \
--splitThese convenience wrappers help you (a) regenerate the full “pipeline artefacts” and (b) reset the workspace back to a clean state.
- Generates candidate artefacts via
generation_main(iterations, prompts, MCP scripts, generated MCP config) - Generates top-entity parsing SPARQL (writes into
ai_generated_contents/) - Promotes candidate prompts + iterations into
ai_generated_contents/(what the runtime pipeline reads by default) - Rewires runtime MCP configs to use the newly generated MCP servers
bash scripts/rebuild_pipeline_artifacts.shOptional flags:
bash scripts/rebuild_pipeline_artifacts.sh --model gpt-5
bash scripts/rebuild_pipeline_artifacts.sh --direct --model gpt-4o
bash scripts/rebuild_pipeline_artifacts.sh --model gpt-5.2 --test
bash scripts/rebuild_pipeline_artifacts.sh --no-promote
bash scripts/rebuild_pipeline_artifacts.sh --no-rewire-mcpMain-only (reuse existing candidate scripts, regenerate only main.py):
bash scripts/rebuild_pipeline_artifacts.sh --test --ontology ontosynthesis --model gpt-4.1 --main-onlyNotes:
- Script generation is direct-by-default (no MCP/Docker required for code output). To force agent/MCP script generation (requires Docker), run the Python entrypoint with
--agent-scripts.
If you already have a generated MCP server and want the KG construction pipeline to use it without rerunning any LLM generation, use:
# Use the already-generated *candidate* MCP server for ontosynthesis
python scripts/rewire_pipeline_mcp.py \
--ontology ontosynthesis \
--tree candidate \
--mcp-set run_created_mcp.json \
--update-meta-taskTo switch back to the production tree (ai_generated_contents/):
python scripts/rewire_pipeline_mcp.py \
--ontology ontosynthesis \
--tree production \
--mcp-set run_created_mcp.json \
--update-meta-taskNotes:
- This updates
configs/run_created_mcp.json(and optionallyconfigs/meta_task/meta_task_config.json) and writes timestamped.bak.*backups. - This does not generate or modify any MCP code; it only changes which module is launched for
llm_created_mcp.
- Dry-run first (prints what would be deleted):
bash scripts/cleanup_results_and_raw_data.sh- Actually delete (irreversible):
bash scripts/cleanup_results_and_raw_data.sh --realBy default it keeps only the DOI mapped to hash 0c57bac8 in raw_data/. You can override:
bash scripts/cleanup_results_and_raw_data.sh --keep-hash 0c57bac8 --realThis is the recommended pre-flight check before running any expensive generation.
bash scripts/test_generation_pipeline.shThis makes one small LLM call to generate a tiny Python file and verifies it compiles.
bash scripts/test_llm_smoke.sh gpt-5.2