MCP-enhanced MOPs Extraction

A multi-stage pipeline for extracting Metal-Organic Polyhedra (MOPs) information from scientific papers using MCP-enhanced LLM agents, producing structured knowledge graphs (TTL).

Setup (detailed)

Prerequisites

Python 3.11+
(Recommended) WSL on Windows for a smoother Linux-like environment
Docker (only if you use MCP tools that require it; some tools are stdio-only)

1) Create a Python environment

# venv
python -m venv .venv
source .venv/bin/activate  # Windows PowerShell: .venv\Scripts\Activate.ps1

# or conda
conda create -n mcp_layer python=3.11
conda activate mcp_layer

2) Install dependencies

pip install -r requirements.txt

3) Bootstrap required runtime folders (important)

This repo git-ignores many runtime folders (caches, logs, generated prompts/scripts).
Some modules (notably models/locations.py) require directories to exist at import time.

Run:

python scripts/bootstrap_repo.py

If you plan to run grounding/lookup agents, also create grounding-cache folders:

python scripts/bootstrap_repo.py --with-grounding-cache ontospecies

4) Configure MCP settings

cp configs/mcp_configs.json.example configs/mcp_configs.json

Then edit configs/mcp_configs.json to reflect your local environment (paths, server commands).

5) Configure LLM credentials (if you run LLM agents)

This repo does not ship a committed .env.example. Create .env in the repo root with what your environment expects. At minimum, many agents expect something like:

API_KEY=...
BASE_URL=...

Exact keys depend on your models/ModelConfig.py / models/LLMCreator.py configuration.

Common folder layout (fresh clone)

After python scripts/bootstrap_repo.py, you should have (among others):

data/ (runtime data, cached results; gitignored)
- data/log/ (required; some modules error if missing)
- data/ontologies/ (place ontology T-Box TTLs here)
- data/grounding_cache/<ontology>/labels (optional; for Script C fuzzy lookup)
raw_data/ (PDF inputs; gitignored)
sandbox/ (scratch scripts; gitignored)
ai_generated_contents*/ (LLM-generated artifacts; gitignored)

Grounding (overview)

There are two “layers”:

Ontology-specific MCP lookup server (generated for a given ontology)
Grounding consumer agent that applies mappings to TTLs

OntoSpecies lookup MCP server

This repo includes configs/grounding.json to run the OntoSpecies lookup server via stdio.

Ground TTLs (single or batch)

The grounding agent lives at src/agents/grounding/grounding_agent.py.

Single file:

python -m src.agents.grounding.grounding_agent --ttl path/to/file.ttl --write-grounded-ttl

Batch folder (recursively processes *.ttl, skipping *_grounded.ttl and *link.ttl):

python -m src.agents.grounding.grounding_agent --batch-dir evaluation/data/merged_tll --write-grounded-ttl

Notes:

Internal merge (deduplicating identical nodes across TTLs) runs by default in batch mode; disable with --no-internal-merge.
Default grounding materialization mode is replace (replaces source_iri with grounded_iri). You can switch to sameas with --grounding-mode sameas.

Main extraction entrypoint

The main pipeline entrypoint is mop_main.py (see its CLI help):

python mop_main.py --help

Prompt + MCP script generation (no `.sh` wrappers)

Use the following canonical Python entrypoints to generate plans, prompts, and MCP scripts.

1) Generate a task division plan (writes `configs/task_division_plan.json`)

python -m src.agents.scripts_and_prompts_generation.task_division_agent \
  --tbox data/ontologies/ontosynthesis.ttl \
  --output configs/task_division_plan.json \
  --model gpt-5

2) Generate KG-building iteration prompts (writes into `ai_generated_contents_candidate/prompts/…`)

python -m src.agents.scripts_and_prompts_generation.task_prompt_creation_agent \
  --version 1 \
  --plan configs/task_division_plan.json \
  --tbox data/ontologies/ontosynthesis.ttl \
  --model gpt-4.1 \
  --parallel 3

3) Generate extraction-scope prompts (writes into `ai_generated_contents_candidate/prompts/…`)

Legacy plan-driven mode (matches the old run_extraction_prompt_creation.sh intent):

python -m src.agents.scripts_and_prompts_generation.task_extraction_prompt_creation_agent \
  --version 1 \
  --plan configs/task_division_plan.json \
  --tbox data/ontologies/ontosynthesis.ttl \
  --model gpt-5 \
  --parallel 3

Iterations-driven mode (uses ontology flags + ai_generated_contents_candidate/iterations/**/iterations.json):

python -m src.agents.scripts_and_prompts_generation.task_extraction_prompt_creation_agent \
  --ontosynthesis \
  --version 1 \
  --model gpt-5 \
  --parallel 3

4) Generate MCP underlying scripts from T-Box (writes into `ai_generated_contents_candidate/scripts/…`)

All ontologies from ape_generated_contents/meta_task_config.json:

python -m src.agents.scripts_and_prompts_generation.mcp_underlying_script_creation_agent --all

Single ontology (by short name or by TTL path):

python -m src.agents.scripts_and_prompts_generation.mcp_underlying_script_creation_agent \
  --ontology ontosynthesis \
  --model gpt-5 \
  --split

Repo maintenance scripts (`scripts/`)

These convenience wrappers help you (a) regenerate the full “pipeline artefacts” and (b) reset the workspace back to a clean state.

1) Regenerate all pipeline artefacts + promote to production

Generates candidate artefacts via generation_main (iterations, prompts, MCP scripts, generated MCP config)
Generates top-entity parsing SPARQL (writes into ai_generated_contents/)
Promotes candidate prompts + iterations into ai_generated_contents/ (what the runtime pipeline reads by default)
Rewires runtime MCP configs to use the newly generated MCP servers

bash scripts/rebuild_pipeline_artifacts.sh

Optional flags:

bash scripts/rebuild_pipeline_artifacts.sh --model gpt-5
bash scripts/rebuild_pipeline_artifacts.sh --direct --model gpt-4o
bash scripts/rebuild_pipeline_artifacts.sh --model gpt-5.2 --test
bash scripts/rebuild_pipeline_artifacts.sh --no-promote
bash scripts/rebuild_pipeline_artifacts.sh --no-rewire-mcp

Main-only (reuse existing candidate scripts, regenerate only main.py):

bash scripts/rebuild_pipeline_artifacts.sh --test --ontology ontosynthesis --model gpt-4.1 --main-only

Notes:

Script generation is direct-by-default (no MCP/Docker required for code output). To force agent/MCP script generation (requires Docker), run the Python entrypoint with --agent-scripts.

1.5) Rewire which MCP the pipeline uses (NO regeneration; cheap)

If you already have a generated MCP server and want the KG construction pipeline to use it without rerunning any LLM generation, use:

# Use the already-generated *candidate* MCP server for ontosynthesis
python scripts/rewire_pipeline_mcp.py \
  --ontology ontosynthesis \
  --tree candidate \
  --mcp-set run_created_mcp.json \
  --update-meta-task

To switch back to the production tree (ai_generated_contents/):

python scripts/rewire_pipeline_mcp.py \
  --ontology ontosynthesis \
  --tree production \
  --mcp-set run_created_mcp.json \
  --update-meta-task

Notes:

This updates configs/run_created_mcp.json (and optionally configs/meta_task/meta_task_config.json) and writes timestamped .bak.* backups.
This does not generate or modify any MCP code; it only changes which module is launched for llm_created_mcp.

2) Clean run outputs + prune `raw_data/` (and clean evaluation artefacts)

Dry-run first (prints what would be deleted):

bash scripts/cleanup_results_and_raw_data.sh

Actually delete (irreversible):

bash scripts/cleanup_results_and_raw_data.sh --real

By default it keeps only the DOI mapped to hash 0c57bac8 in raw_data/. You can override:

bash scripts/cleanup_results_and_raw_data.sh --keep-hash 0c57bac8 --real

3) Cheap “unit tests” for generation pipeline setup (no LLM calls)

This is the recommended pre-flight check before running any expensive generation.

bash scripts/test_generation_pipeline.sh

4) Real LLM smoke test (1 cheap call)

This makes one small LLM call to generate a tiny Python file and verifies it compiles.

bash scripts/test_llm_smoke.sh gpt-5.2

Name		Name	Last commit message	Last commit date
Latest commit History 205 Commits
ape_generated_contents		ape_generated_contents
configs		configs
docs		docs
evaluation		evaluation
mini_marie		mini_marie
models		models
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
generic_main.py		generic_main.py
mop_main.py		mop_main.py
ontosynthesis_guard_state.json		ontosynthesis_guard_state.json
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MCP-enhanced MOPs Extraction

Setup (detailed)

Prerequisites

1) Create a Python environment

2) Install dependencies

3) Bootstrap required runtime folders (important)

4) Configure MCP settings

5) Configure LLM credentials (if you run LLM agents)

Common folder layout (fresh clone)

Grounding (overview)

OntoSpecies lookup MCP server

Ground TTLs (single or batch)

Main extraction entrypoint

Prompt + MCP script generation (no `.sh` wrappers)

1) Generate a task division plan (writes `configs/task_division_plan.json`)

2) Generate KG-building iteration prompts (writes into `ai_generated_contents_candidate/prompts/…`)

3) Generate extraction-scope prompts (writes into `ai_generated_contents_candidate/prompts/…`)

4) Generate MCP underlying scripts from T-Box (writes into `ai_generated_contents_candidate/scripts/…`)

Repo maintenance scripts (`scripts/`)

1) Regenerate all pipeline artefacts + promote to production

1.5) Rewire which MCP the pipeline uses (NO regeneration; cheap)

2) Clean run outputs + prune `raw_data/` (and clean evaluation artefacts)

3) Cheap “unit tests” for generation pipeline setup (no LLM calls)

4) Real LLM smoke test (1 cheap call)

About

Uh oh!

Releases

Packages

Languages

TheWorldAvatar/mcp-tool-layer

Folders and files

Latest commit

History

Repository files navigation

MCP-enhanced MOPs Extraction

Setup (detailed)

Prerequisites

1) Create a Python environment

2) Install dependencies

3) Bootstrap required runtime folders (important)

4) Configure MCP settings

5) Configure LLM credentials (if you run LLM agents)

Common folder layout (fresh clone)

Grounding (overview)

OntoSpecies lookup MCP server

Ground TTLs (single or batch)

Main extraction entrypoint

Prompt + MCP script generation (no .sh wrappers)

1) Generate a task division plan (writes configs/task_division_plan.json)

2) Generate KG-building iteration prompts (writes into ai_generated_contents_candidate/prompts/…)

3) Generate extraction-scope prompts (writes into ai_generated_contents_candidate/prompts/…)

4) Generate MCP underlying scripts from T-Box (writes into ai_generated_contents_candidate/scripts/…)

Repo maintenance scripts (scripts/)

1) Regenerate all pipeline artefacts + promote to production

1.5) Rewire which MCP the pipeline uses (NO regeneration; cheap)

2) Clean run outputs + prune raw_data/ (and clean evaluation artefacts)

3) Cheap “unit tests” for generation pipeline setup (no LLM calls)

4) Real LLM smoke test (1 cheap call)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Prompt + MCP script generation (no `.sh` wrappers)

1) Generate a task division plan (writes `configs/task_division_plan.json`)

2) Generate KG-building iteration prompts (writes into `ai_generated_contents_candidate/prompts/…`)

3) Generate extraction-scope prompts (writes into `ai_generated_contents_candidate/prompts/…`)

4) Generate MCP underlying scripts from T-Box (writes into `ai_generated_contents_candidate/scripts/…`)

Repo maintenance scripts (`scripts/`)

2) Clean run outputs + prune `raw_data/` (and clean evaluation artefacts)

Packages