diff --git a/AGENTS.md b/AGENTS.md index b5eecb4..ebbd975 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -56,6 +56,21 @@ DOC2X_API_KEY=sk-xxx npm run build && npm start - Zod URL validation should use `z.url()` (for example via `.pipe(z.url())`) instead of deprecated `z.string().url()`. +## Release Checklist + +When bumping the package version, always update all three of the following together: + +1. **`package.json`** — update `"version"` field. +2. **`CHANGELOG.md`** — add a new section `## [x.y.z] - YYYY-MM-DD` with a summary of changes. Move items from `Unreleased` if applicable. +3. **`README.md` / `README_EN.md`** — if any tool names, parameters, env vars, or workflows changed, sync the relevant sections. + +After releasing a new version, remind users to re-run the Skill install command to pick up the latest tool descriptions: + +```bash +# One-command update (no clone needed) +curl -fsSL https://raw.githubusercontent.com/NoEdgeAI/doc2x-mcp/main/scripts/install-skill.sh | sh +``` + ## Commit & Pull Request Guidelines - Use Conventional Commits style (e.g., `feat: ...`, `fix: ...`, `docs: ...`). diff --git a/CHANGELOG.md b/CHANGELOG.md new file mode 100644 index 0000000..dc68416 --- /dev/null +++ b/CHANGELOG.md @@ -0,0 +1,39 @@ +# Changelog + +All notable changes to this project will be documented in this file. + +## [0.1.4] - Unreleased + +- feat: add project icon (`icon.png`) +- chore: upgrade `@modelcontextprotocol/sdk` to fix vulnerabilities +- feat: add display page support + +## [0.1.3] - 2026-02-28 + +- feat: add v3-2026 parse model support (`doc2x_parse_pdf_submit`, `doc2x_parse_pdf_wait_text`) +- feat: add `doc2x_materialize_pdf_layout_json` tool for v3 layout JSON materialization +- feat: restructure source packages for better maintainability +- fix: support explicit `v2` parse model parameter + +## [0.1.2] - 2026-01-19 + +- feat: add Skill installation scripts (Bash, PowerShell 7+, Windows PowerShell 5.1) +- fix: install skill shell script issues +- fix: update skill installation category from `local` to `public` +- fix: restrict `doc2x_parse_pdf_status` response to status fields only +- chore: streamline CI workflow + +## [0.1.1] - 2026-01-17 + +- feat: cap parse output via `DOC2X_PARSE_PDF_MAX_OUTPUT_CHARS` and `DOC2X_PARSE_PDF_MAX_OUTPUT_PAGES` +- feat: improve developer ergonomics for MCP tools +- ci: set up GitHub Actions publish and build workflows + +## [0.1.0] - Initial release + +- feat: initial Doc2x MCP server implementation +- feat: PDF parse tools (`submit` / `status` / `wait_text`) +- feat: export tools (`submit` / `result` / `wait`) +- feat: image layout parse tools (sync / async) +- feat: download tools (`download_url_to_file`, `materialize_convert_zip`) +- feat: `doc2x_debug_config` diagnostics tool \ No newline at end of file diff --git a/README.md b/README.md index 3427399..2df9833 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,16 @@ # Doc2x MCP Server +

+ + + + + + + + +

+ [![CI](https://github.com/NoEdgeAI/doc2x-mcp/actions/workflows/ci.yml/badge.svg)](https://github.com/NoEdgeAI/doc2x-mcp/actions/workflows/ci.yml) [![Publish](https://github.com/NoEdgeAI/doc2x-mcp/actions/workflows/publish.yml/badge.svg)](https://github.com/NoEdgeAI/doc2x-mcp/actions/workflows/publish.yml) [![npm version](https://img.shields.io/npm/v/%40noedgeai-org%2Fdoc2x-mcp)](https://www.npmjs.com/package/@noedgeai-org/doc2x-mcp) @@ -22,6 +33,7 @@ - [安装本仓库 Skill(可选)](#安装本仓库-skill可选) - [安全与排错](#安全与排错) - [问题反馈](#问题反馈) +- [Changelog](./CHANGELOG.md) - [License](#license) ## 项目定位 @@ -93,7 +105,7 @@ MCP client 指向本地构建产物: | 阶段 | Tools | 说明 | | --- | --- | --- | -| PDF 解析 | `doc2x_parse_pdf_submit` / `doc2x_parse_pdf_status` / `doc2x_parse_pdf_wait_text` | 提交任务、查询状态、等待并取文本 | +| PDF 解析 | `doc2x_parse_pdf_submit` / `doc2x_parse_pdf_status` / `doc2x_parse_pdf_wait_text` / `doc2x_materialize_pdf_layout_json` | 提交任务、查询状态、等待并取文本,或将 v3 layout 结果落盘为本地 JSON | | 结果导出 | `doc2x_convert_export_submit` / `doc2x_convert_export_result` / `doc2x_convert_export_wait` | 发起导出、查结果、等待导出完成 | | 下载落盘 | `doc2x_download_url_to_file` / `doc2x_materialize_convert_zip` | 下载 URL 到本地、解包 convert zip | | 图片版面解析 | `doc2x_parse_image_layout_sync` / `doc2x_parse_image_layout_submit` / `doc2x_parse_image_layout_status` / `doc2x_parse_image_layout_wait_text` | 同步/异步图片 OCR 与版面解析 | @@ -102,7 +114,7 @@ MCP client 指向本地构建产物: ### PDF 解析模型(`doc2x_parse_pdf_submit` / `doc2x_parse_pdf_wait_text`) - 可选参数:`model` -- 可选值:`v3-2026`(最新模型) +- 可选值:`v2`(默认) / `v3-2026`(最新模型) - 不传时默认 `v2` ```json @@ -111,6 +123,23 @@ MCP client 指向本地构建产物: } ``` +### PDF Layout JSON 落盘(`doc2x_materialize_pdf_layout_json`) + +- 必选参数:`output_path` +- `uid` 与 `pdf_path` 二选一 +- `v2` 不支持 `layout`;需要 `pages[].layout` 时请使用 `v3-2026` +- 若传 `pdf_path` 但不传 `model`,该工具默认使用 `v3-2026` +- 成功时将原始 `result` JSON 写到本地 + +`layout` 是页面块结构和坐标信息,适合 figure/table 裁剪、区域高亮、结构化抽取和版面分析;如果只想看正文内容,优先使用 Markdown / DOCX 导出。 + +```json +{ + "pdf_path": "/absolute/path/to/input.pdf", + "output_path": "/absolute/path/to/input_v3.layout.json" +} +``` + ### 导出公式参数(`doc2x_convert_export_submit` / `doc2x_convert_export_wait`) - 必选参数:`formula_mode`(`normal` / `dollar`) @@ -134,6 +163,12 @@ MCP client 指向本地构建产物: 1. `doc2x_parse_image_layout_sync` 直接同步解析。 2. 若需要稳态轮询,改用 submit/status/wait 组合。 +### 工作流 3:PDF -> v3 layout JSON 本地文件 + +1. 调用 `doc2x_materialize_pdf_layout_json`,传入 `pdf_path` 和 `output_path`。 +2. 工具会等待 parse 成功,并将原始 `result` JSON 落到本地。 +3. 该 JSON 可直接提供给后续 figure/table 裁剪脚本使用。 + ## 本地开发 ### 环境要求 @@ -191,7 +226,9 @@ pnpm audit --prod --audit-level high ## 安装本仓库 Skill(可选) -用于给 Codex CLI / Claude Code 增加一个“教大模型如何使用 doc2x-mcp tools 的 Skill”。 +用于给 Codex CLI / Claude Code 增加一个"教大模型如何使用 doc2x-mcp tools 的 Skill"。 + +> **提示:** 每次升级 `doc2x-mcp` 版本后,建议重新运行安装命令以更新 Skill,确保大模型使用最新的 tool 描述与工作流。 不需要 clone 仓库的一键安装(推荐): diff --git a/README_EN.md b/README_EN.md index 6000910..4333289 100644 --- a/README_EN.md +++ b/README_EN.md @@ -1,5 +1,16 @@ # Doc2x MCP Server +

+ + + + + + + + +

+ [![CI](https://github.com/NoEdgeAI/doc2x-mcp/actions/workflows/ci.yml/badge.svg)](https://github.com/NoEdgeAI/doc2x-mcp/actions/workflows/ci.yml) [![Publish](https://github.com/NoEdgeAI/doc2x-mcp/actions/workflows/publish.yml/badge.svg)](https://github.com/NoEdgeAI/doc2x-mcp/actions/workflows/publish.yml) [![npm version](https://img.shields.io/npm/v/%40noedgeai-org%2Fdoc2x-mcp)](https://www.npmjs.com/package/@noedgeai-org/doc2x-mcp) @@ -22,6 +33,7 @@ A stdio-based MCP Server that wraps Doc2x v2 PDF/image capabilities into stable, - [Install Repo Skill (Optional)](#install-repo-skill-optional) - [Security and Troubleshooting](#security-and-troubleshooting) - [Getting Help](#getting-help) +- [Changelog](./CHANGELOG.md) - [License](#license) ## Project Scope @@ -93,7 +105,7 @@ Point MCP client to your local build output: | Stage | Tools | Purpose | | --- | --- | --- | -| PDF parse | `doc2x_parse_pdf_submit` / `doc2x_parse_pdf_status` / `doc2x_parse_pdf_wait_text` | Submit parse tasks, check status, wait and fetch text | +| PDF parse | `doc2x_parse_pdf_submit` / `doc2x_parse_pdf_status` / `doc2x_parse_pdf_wait_text` / `doc2x_materialize_pdf_layout_json` | Submit parse tasks, check status, wait and fetch text, or materialize v3 layout JSON locally | | Export | `doc2x_convert_export_submit` / `doc2x_convert_export_result` / `doc2x_convert_export_wait` | Start export, read export result, wait for completion | | Download | `doc2x_download_url_to_file` / `doc2x_materialize_convert_zip` | Download export URL to local path, materialize convert zip | | Image layout parse | `doc2x_parse_image_layout_sync` / `doc2x_parse_image_layout_submit` / `doc2x_parse_image_layout_status` / `doc2x_parse_image_layout_wait_text` | Sync/async OCR and layout parse for images | @@ -102,7 +114,7 @@ Point MCP client to your local build output: ### PDF Parse Model (`doc2x_parse_pdf_submit` / `doc2x_parse_pdf_wait_text`) - Optional parameter: `model` -- Supported value: `v3-2026` (latest model) +- Supported values: `v2` (default) / `v3-2026` (latest model) - Default (when omitted): `v2` ```json @@ -111,6 +123,23 @@ Point MCP client to your local build output: } ``` +### PDF Layout JSON Materialization (`doc2x_materialize_pdf_layout_json`) + +- Required: `output_path` +- Provide either `uid` or `pdf_path` +- `v2` does not support `layout`; use `v3-2026` when `pages[].layout` is required +- When `pdf_path` is used and `model` is omitted, this tool defaults to `v3-2026` +- On success it writes the raw parse `result` JSON locally + +`layout` contains page block structure and coordinates, which is useful for figure/table crops, region highlighting, structured extraction, and layout analysis. If the goal is readable full text, prefer Markdown / DOCX export. + +```json +{ + "pdf_path": "/absolute/path/to/input.pdf", + "output_path": "/absolute/path/to/input_v3.layout.json" +} +``` + ### Export Formula Parameters (`doc2x_convert_export_submit` / `doc2x_convert_export_wait`) - Required: `formula_mode` (`normal` / `dollar`) @@ -134,6 +163,12 @@ Point MCP client to your local build output: 1. Use `doc2x_parse_image_layout_sync` for direct parse. 2. For robust polling behavior, switch to submit/status/wait flow. +### Workflow 3: PDF -> local v3 layout JSON + +1. Call `doc2x_materialize_pdf_layout_json` with `pdf_path` and `output_path`. +2. The tool waits for parse success and writes the raw `result` JSON locally. +3. The saved JSON can be consumed directly by downstream figure/table crop scripts. + ## Local Development ### Requirements @@ -193,6 +228,8 @@ pnpm audit --prod --audit-level high Installs a reusable skill for Codex CLI / Claude Code to guide tool usage with the standard `submit/status/wait/export/download` workflow. +> **Note:** After upgrading `doc2x-mcp` to a new version, re-run the install command to update the Skill and ensure the model uses the latest tool descriptions and workflows. + One-command install without cloning (recommended): ```bash diff --git a/icon.png b/icon.png new file mode 100644 index 0000000..3ce82ca Binary files /dev/null and b/icon.png differ diff --git a/package.json b/package.json index cfd7dd1..20fc246 100644 --- a/package.json +++ b/package.json @@ -1,6 +1,6 @@ { "name": "@noedgeai-org/doc2x-mcp", - "version": "0.1.3", + "version": "0.1.4", "description": "Doc2x MCP server (stdio, MCP SDK).", "license": "MIT", "engines": { @@ -21,7 +21,8 @@ "./scripts/install-skill.ps1", "./scripts/install-skill-winps.ps1", "./skills/doc2x-mcp/SKILL.md", - "./package.json" + "./package.json", + "./icon.png" ], "scripts": { "build": "node ./node_modules/typescript/bin/tsc -p tsconfig.json", @@ -31,13 +32,13 @@ "skill:install:ps": "pwsh -NoProfile -ExecutionPolicy Bypass -File scripts/install-skill.ps1", "skill:install:winps": "powershell -NoProfile -ExecutionPolicy Bypass -File scripts/install-skill-winps.ps1", "start": "node dist/index.js", - "test:unit": "npm run build && node --test test/unit/registerToolsShared.test.js", + "test:unit": "npm run build && node --test test/unit/registerToolsShared.test.js test/unit/materialize.test.js", "test:e2e": "npm run build && node --test test/e2e/mcpServer.e2e.test.js", "test": "npm run test:unit && npm run test:e2e", "prepublishOnly": "pnpm run build" }, "dependencies": { - "@modelcontextprotocol/sdk": "1.26.0", + "@modelcontextprotocol/sdk": "1.27.1", "@types/lodash": "^4.17.23", "lodash": "4.17.23", "lru-cache": "^11.2.6", diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml index 9cf45d5..5ece15d 100644 --- a/pnpm-lock.yaml +++ b/pnpm-lock.yaml @@ -9,8 +9,8 @@ importers: .: dependencies: '@modelcontextprotocol/sdk': - specifier: 1.26.0 - version: 1.26.0(zod@4.3.6) + specifier: 1.27.1 + version: 1.27.1(zod@4.3.6) '@types/lodash': specifier: ^4.17.23 version: 4.17.23 @@ -36,14 +36,14 @@ importers: packages: - '@hono/node-server@1.19.9': - resolution: {integrity: sha512-vHL6w3ecZsky+8P5MD+eFfaGTyCeOHUIFYMGpQGbrBTSmNNoxv0if69rEZ5giu36weC5saFuznL411gRX7bJDw==} + '@hono/node-server@1.19.11': + resolution: {integrity: sha512-dr8/3zEaB+p0D2n/IUrlPF1HZm586qgJNXK1a9fhg/PzdtkK7Ksd5l312tJX2yBuALqDYBlG20QEbayqPyxn+g==} engines: {node: '>=18.14.1'} peerDependencies: hono: ^4 - '@modelcontextprotocol/sdk@1.26.0': - resolution: {integrity: sha512-Y5RmPncpiDtTXDbLKswIJzTqu2hyBKxTNsgKqKclDbhIgg1wgtf1fRuvxgTnRfcnxtvvgbIEcqUOzZrJ6iSReg==} + '@modelcontextprotocol/sdk@1.27.1': + resolution: {integrity: sha512-sr6GbP+4edBwFndLbM60gf07z0FQ79gaExpnsjMGePXqFcSSb7t6iscpjk9DhFhwd+mTEQrzNafGP8/iGGFYaA==} engines: {node: '>=18'} peerDependencies: '@cfworker/json-schema': ^4.1.1 @@ -70,8 +70,8 @@ packages: ajv: optional: true - ajv@8.17.1: - resolution: {integrity: sha512-B/gBuNg5SiMTrPkC+A2+cW0RszwxYmn6VYxB/inlBStS5nx6xHIt/ehKRhIMhqusl7a8LjQoZnjCs5vhwxOQ1g==} + ajv@8.18.0: + resolution: {integrity: sha512-PlXPeEWMXMZ7sPYOHqmDyCJzcfNrUr3fGNKtezX14ykXOEIvyK81d+qydx89KY5O71FKMPaQ2vBfBFI5NHR63A==} body-parser@2.2.2: resolution: {integrity: sha512-oP5VkATKlNwcgvxi0vM0p/D3n2C3EReYVX+DNYs5TjZFn/oQt2j+4sVJtSMr18pdRr8wjTcBl6LoV+FUwzPmNA==} @@ -164,8 +164,8 @@ packages: resolution: {integrity: sha512-CRT1WTyuQoD771GW56XEZFQ/ZoSfWid1alKGDYMmkt2yl8UXrVR4pspqWNEcqKvVIzg6PAltWjxcSSPrboA4iA==} engines: {node: '>=18.0.0'} - express-rate-limit@8.2.1: - resolution: {integrity: sha512-PCZEIEIxqwhzw4KF0n7QF4QqruVTcF73O5kFKUnGOyjbCCgizBBiFaYpd/fnBLUMPw/BWw9OsiN7GgrNYr7j6g==} + express-rate-limit@8.3.1: + resolution: {integrity: sha512-D1dKN+cmyPWuvB+G2SREQDzPY1agpBIcTa9sJxOPMCNeH3gwzhqJRDWCXW3gg0y//+LQ/8j52JbMROWyrKdMdw==} engines: {node: '>= 16'} peerDependencies: express: '>= 4.11' @@ -215,8 +215,8 @@ packages: resolution: {integrity: sha512-0hJU9SCPvmMzIBdZFqNPXWa6dqh7WdH0cII9y+CyS8rG3nL48Bclra9HmKhVVUHyPWNH5Y7xDwAB7bfgSjkUMQ==} engines: {node: '>= 0.4'} - hono@4.11.9: - resolution: {integrity: sha512-Eaw2YTGM6WOxA6CXbckaEvslr2Ne4NFsKrvc0v97JD5awbmeBLO5w9Ho9L9kmKonrwF9RJlW6BxT1PVv/agBHQ==} + hono@4.12.7: + resolution: {integrity: sha512-jq9l1DM0zVIvsm3lv9Nw9nlJnMNPOcAtsbsgiUhWcFzPE99Gvo6yRTlszSLLYacMeQ6quHD6hMfId8crVHvexw==} engines: {node: '>=16.9.0'} http-errors@2.0.1: @@ -230,8 +230,8 @@ packages: inherits@2.0.4: resolution: {integrity: sha512-k/vGaX4/Yla3WzyMCvTQOXYeIHvqOKtnqBduzTHpzpQZzAskKMhZ2K+EnBiSM9zGSoIFeMpXKxa4dYeZIQqewQ==} - ip-address@10.0.1: - resolution: {integrity: sha512-NWv9YLW4PoW2B7xtzaS3NCot75m6nK7Icdv0o3lfMceJVRfSoQwqD4wEH5rLwoKJwUiZ/rfpiVBhnaF0FK4HoA==} + ip-address@10.1.0: + resolution: {integrity: sha512-XXADHxXmvT9+CRxhXg56LJovE+bmWnEWB78LB83VZTprKTmaC5QfruXocxzTZ2Kl0DNwKuBdlIhjL8LeY8Sf8Q==} engines: {node: '>= 12'} ipaddr.js@1.9.1: @@ -244,8 +244,8 @@ packages: isexe@2.0.0: resolution: {integrity: sha512-RHxMLp9lnKHGHRng9QFhRCMbYAcVpn69smSGcq3f36xjgVVWThj4qqLbTLlq7Ssj8B+fIQ1EuCEGI2lKsyQeIw==} - jose@6.1.3: - resolution: {integrity: sha512-0TpaTfihd4QMNwrz/ob2Bp7X04yuxJkjRGi4aKmOqwhov54i6u79oCv7T+C7lo70MKH6BesI3vscD1yb/yzKXQ==} + jose@6.2.1: + resolution: {integrity: sha512-jUaKr1yrbfaImV7R2TN/b3IcZzsw38/chqMpo2XJ7i2F8AfM/lA4G1goC3JVEwg0H7UldTmSt3P68nt31W7/mw==} json-schema-traverse@1.0.0: resolution: {integrity: sha512-NM8/P9n3XjXhIZn1lLhkFaACTOURQXjWhV4BA/RnOv8xvgqtqpAX9IO4mRQxSx1Rlo4tqzeqb0sOlruaOy3dug==} @@ -326,8 +326,8 @@ packages: resolution: {integrity: sha512-llQsMLSUDUPT44jdrU/O37qlnifitDP+ZwrmmZcoSKyLKvtZxpyV0n2/bD/N4tBAAZ/gJEdZU7KMraoK1+XYAg==} engines: {node: '>= 0.10'} - qs@6.14.2: - resolution: {integrity: sha512-V/yCWTTF7VJ9hIh18Ugr2zhJMP01MY7c5kh4J870L7imm6/DIzBsNLTXzMwUA3yZ5b/KBqLx8Kp3uRvd7xSe3Q==} + qs@6.15.0: + resolution: {integrity: sha512-mAZTtNCeetKMH+pSjrb76NAM8V9a05I9aBZOHztWy/UqcJdQYNsf59vrRKWnojAT9Y+GbIvoTBC++CPHqpDBhQ==} engines: {node: '>=0.6'} range-parser@1.2.1: @@ -430,24 +430,24 @@ packages: snapshots: - '@hono/node-server@1.19.9(hono@4.11.9)': + '@hono/node-server@1.19.11(hono@4.12.7)': dependencies: - hono: 4.11.9 + hono: 4.12.7 - '@modelcontextprotocol/sdk@1.26.0(zod@4.3.6)': + '@modelcontextprotocol/sdk@1.27.1(zod@4.3.6)': dependencies: - '@hono/node-server': 1.19.9(hono@4.11.9) - ajv: 8.17.1 - ajv-formats: 3.0.1(ajv@8.17.1) + '@hono/node-server': 1.19.11(hono@4.12.7) + ajv: 8.18.0 + ajv-formats: 3.0.1(ajv@8.18.0) content-type: 1.0.5 cors: 2.8.6 cross-spawn: 7.0.6 eventsource: 3.0.7 eventsource-parser: 3.0.6 express: 5.2.1 - express-rate-limit: 8.2.1(express@5.2.1) - hono: 4.11.9 - jose: 6.1.3 + express-rate-limit: 8.3.1(express@5.2.1) + hono: 4.12.7 + jose: 6.2.1 json-schema-typed: 8.0.2 pkce-challenge: 5.0.1 raw-body: 3.0.2 @@ -467,11 +467,11 @@ snapshots: mime-types: 3.0.2 negotiator: 1.0.0 - ajv-formats@3.0.1(ajv@8.17.1): + ajv-formats@3.0.1(ajv@8.18.0): optionalDependencies: - ajv: 8.17.1 + ajv: 8.18.0 - ajv@8.17.1: + ajv@8.18.0: dependencies: fast-deep-equal: 3.1.3 fast-uri: 3.1.0 @@ -486,7 +486,7 @@ snapshots: http-errors: 2.0.1 iconv-lite: 0.7.2 on-finished: 2.4.1 - qs: 6.14.2 + qs: 6.15.0 raw-body: 3.0.2 type-is: 2.0.1 transitivePeerDependencies: @@ -557,10 +557,10 @@ snapshots: dependencies: eventsource-parser: 3.0.6 - express-rate-limit@8.2.1(express@5.2.1): + express-rate-limit@8.3.1(express@5.2.1): dependencies: express: 5.2.1 - ip-address: 10.0.1 + ip-address: 10.1.0 express@5.2.1: dependencies: @@ -584,7 +584,7 @@ snapshots: once: 1.4.0 parseurl: 1.3.3 proxy-addr: 2.0.7 - qs: 6.14.2 + qs: 6.15.0 range-parser: 1.2.1 router: 2.2.0 send: 1.2.1 @@ -642,7 +642,7 @@ snapshots: dependencies: function-bind: 1.1.2 - hono@4.11.9: {} + hono@4.12.7: {} http-errors@2.0.1: dependencies: @@ -658,7 +658,7 @@ snapshots: inherits@2.0.4: {} - ip-address@10.0.1: {} + ip-address@10.1.0: {} ipaddr.js@1.9.1: {} @@ -666,7 +666,7 @@ snapshots: isexe@2.0.0: {} - jose@6.1.3: {} + jose@6.2.1: {} json-schema-traverse@1.0.0: {} @@ -719,7 +719,7 @@ snapshots: forwarded: 0.2.0 ipaddr.js: 1.9.1 - qs@6.14.2: + qs@6.15.0: dependencies: side-channel: 1.1.0 diff --git a/skills/doc2x-mcp/SKILL.md b/skills/doc2x-mcp/SKILL.md index 34c667b..2ae3e32 100644 --- a/skills/doc2x-mcp/SKILL.md +++ b/skills/doc2x-mcp/SKILL.md @@ -1,173 +1,116 @@ --- name: doc2x-mcp -description: 使用 Doc2x MCP 工具完成文档解析与转换:对 PDF/扫描件/图片做 OCR 与版面解析,抽取文本/表格,导出为 Markdown/LaTeX(TeX)/DOCX 并下载落盘(submit/status/wait/export/download)。当用户提到 PDF/pdfs、scanned PDF、OCR、image-to-text、extract text/tables、表格抽取、文档转换/convert、导出/export、Markdown、LaTeX/TeX、DOCX、doc2x、doc2x-mcp、MCP 时使用。 +description: 使用 Doc2x MCP 工具处理 PDF、扫描件和图片:提交解析、查询状态、等待文本、导出 Markdown/LaTeX/DOCX、下载落盘,以及将 PDF v3 layout 结果写为本地 JSON。用户提到 PDF、OCR、scan/scanned PDF、image-to-text、extract text/tables、表格抽取、layout、Markdown、LaTeX/TeX、DOCX、doc2x、doc2x-mcp、MCP、figure/table crop、v3 JSON 时使用。 --- -# Doc2x MCP Tool-Use Skill (for LLM) +# Doc2x MCP -## 你要做什么 +## 目的 -你是一个会调用 MCP tools 的助手。凡是涉及 PDF/图片的“解析/抽取/导出/下载”,必须通过 `doc2x-mcp` tools 执行真实操作: +凡是“解析 PDF/图片、抽取文本/表格、导出文档、下载结果、获取 v3 layout JSON”的请求,都应通过 `doc2x-mcp` tools 执行真实操作,不要臆造 `uid`、`url`、文件内容或导出结果。 -- 不要臆测/伪造 `uid`、`url`、文件内容或导出结果 -- 不要跳过工具步骤直接输出“看起来合理”的内容 +## 必须遵守 -## 全局约束(必须遵守) +1. 所有文件路径都用绝对路径:`pdf_path`、`image_path`、`output_path`、`output_dir`。 +2. 不要伪造下载 URL;只能使用 `doc2x_convert_export_*` 返回的 `url`。 +3. 同一个 `uid` 的同一组导出参数不要并发重复提交。 +4. 同一个 `uid` 做多档导出对比时,必须按“导出成功 -> 立即下载 -> 再导出下一档”执行,避免结果覆盖。 +5. 不要回显 `DOC2X_API_KEY`;排错只用 `doc2x_debug_config` 的摘要信息。 +6. `model` 只用于 PDF 解析提交;`formula_level` 只用于导出,且仅在源解析为 `v3-2026` 时有效。 +7. `doc2x_parse_pdf_wait_text` 只适合预览或摘要;需要完整结果时优先导出文件。 +8. 需要 PDF v3 block/layout 坐标时,不要从文本结果推断,直接使用 `doc2x_materialize_pdf_layout_json`。 -1. 路径必须是绝对路径 - `pdf_path` / `image_path` / `output_path` / `output_dir` 都应使用绝对路径;相对路径可能会被 server 以意外的 cwd 解析导致失败。 +## 参数边界 -2. 扩展名约束 - `doc2x_parse_pdf_submit.pdf_path` 必须以 `.pdf` 结尾;图片解析使用 `png/jpg`。 +- PDF 解析:`doc2x_parse_pdf_submit` 和 `doc2x_parse_pdf_wait_text(pdf_path 分支)` 可传 `model: "v2" | "v3-2026"`;不传默认 `v2`。 +- PDF layout JSON:`doc2x_materialize_pdf_layout_json` 在 `pdf_path` 分支默认使用 `v3-2026`,并要求返回结果包含 `pages[].layout`。 +- 导出:`formula_mode` 建议总是显式传入。 +- `formula_level` 必须传数字 `0 | 1 | 2`,不要传字符串。 +- 图片解析路径只接受 `png/jpg/jpeg`;PDF 路径必须以 `.pdf` 结尾;layout JSON 输出路径应以 `.json` 结尾。 -3. 不要并发重复提交导出 - 同一个 `uid` 对同一种导出配置(`to + formula_mode + formula_level (+ filename + filename_mode + merge_cross_page_forms...)`)不要并行重复 submit。 - 补充:同一 `uid + to` 的导出结果可能会被后一次覆盖;做“多档对比”(如 `formula_level=0/1/2`)时,必须按 **导出成功 → 立即下载落盘 → 再导出下一档** 的顺序执行。 +## 按目标选 Tool -4. 不要泄露密钥 - 永远不要回显/记录 `DOC2X_API_KEY`。排错只用 `doc2x_debug_config` 的 `apiKeyLen/apiKeyPrefix/apiKeySource`。 +- 提交 PDF 解析:`doc2x_parse_pdf_submit` +- 查看 PDF 状态:`doc2x_parse_pdf_status` +- 取 PDF 文本预览:`doc2x_parse_pdf_wait_text` +- 导出 PDF 为 `md/tex/docx`:`doc2x_convert_export_wait` +- 下载导出文件:`doc2x_download_url_to_file` +- 落盘 PDF v3 layout JSON:`doc2x_materialize_pdf_layout_json` +- 图片版面解析原始结果:`doc2x_parse_image_layout_sync` +- 图片版面解析并等待首屏 Markdown:`doc2x_parse_image_layout_submit` -> `doc2x_parse_image_layout_wait_text` +- 落盘 `convert_zip`:`doc2x_materialize_convert_zip` +- 配置排错:`doc2x_debug_config` -5. 不要伪造下载 URL - 下载必须使用 `doc2x_convert_export_*` 返回的 `url`;不要自己拼接。 +## 标准流程 -6. 参数生效边界 - `model` 仅用于 PDF 解析提交(默认 `v2`,可选 `v3-2026`);`formula_level` 仅用于导出(`doc2x_convert_export_*`),并且只在源解析任务使用 `v3-2026` 时生效(`v2` 下无效)。 +### 1. PDF -> 完整文件 -## 关键参数语义(避免误用) +当用户要完整 Markdown / TeX / DOCX,本流程优先: -- `doc2x_parse_pdf_submit` / `doc2x_parse_pdf_wait_text(pdf_path 提交分支)` - - 可选 `model: "v3-2026"`;不传则默认 `v2`。 -- `doc2x_convert_export_submit` / `doc2x_convert_export_wait` - - `formula_mode`:`"normal"` 或 `"dollar"`(关键参数,建议总是显式传入)。 - - `formula_level`:`0 | 1 | 2`(可选,**数字类型**,不要传字符串 `"0"|"1"|"2"`) - - `0`:不退化公式(保留原始 Markdown) - - `1`:行内公式退化为普通文本(`\(...\)`、`$...$`) - - `2`:行内 + 块级公式全部退化为普通文本(`\(...\)`、`$...$`、`\[...\]`、`$$...$$`) +1. `doc2x_parse_pdf_submit({ pdf_path, model? })` +2. 轮询 `doc2x_parse_pdf_status({ uid })` 直到成功 +3. `doc2x_convert_export_wait({ uid, to, formula_mode, formula_level?, filename?, filename_mode? })` +4. `doc2x_download_url_to_file({ url, output_path })` -## Tool 选择(按用户目标) +说明: -- **PDF 解析任务**:`doc2x_parse_pdf_submit` → `doc2x_parse_pdf_status` -- **少量预览/摘要**:`doc2x_parse_pdf_wait_text`(可能截断;要完整内容请导出文件) -- **导出文件(md/tex/docx)**:`doc2x_convert_export_submit` → `doc2x_convert_export_wait`(或直接 `doc2x_convert_export_wait` 走兼容模式一键导出) -- **下载落盘**:`doc2x_download_url_to_file` -- **图片版面解析**:`doc2x_parse_image_layout_sync` 或 `doc2x_parse_image_layout_submit` → `doc2x_parse_image_layout_wait_text` -- **解包资源 zip**:`doc2x_materialize_convert_zip` -- **配置排错**:`doc2x_debug_config` +- `md/docx` 常用 `formula_mode: "normal"` +- `tex` 常用 `formula_mode: "dollar"` +- 需要完整内容时,不要用 `doc2x_parse_pdf_wait_text` 代替导出 -## 标准工作流(照做) +### 2. PDF -> 文本预览 -### 工作流 A:批量 PDF → 导出文件(MD/TEX/DOCX,高效并行版) +仅在用户要快速预览、摘要、少量文本时使用: -适用于“多个 PDF 批量导出并落盘(.md / .tex / .docx)”。核心原则: +- `doc2x_parse_pdf_wait_text({ pdf_path | uid, max_output_chars?, max_output_pages?, model? })` -- `doc2x_parse_pdf_submit` 可并行(批量提交) -- `doc2x_parse_pdf_status` 可并行(批量轮询) -- **流水线式并行**:某个 `uid` 一旦解析成功,立刻开始该 `uid` 的导出+下载(不必等所有 PDF 都解析完) -- 不同 `uid` 的导出与下载可并行 -- **同一个 `uid` 的同一种导出配置(`to + formula_mode + formula_level (+ filename + filename_mode + merge_cross_page_forms...)`)不要并行重复提交** -- 同一个 `uid` 若要导出多种格式(例如 md + docx + tex),建议**按格式串行**,但不同 `uid` 仍可并行 +若出现截断提示,应切回“PDF -> 完整文件”流程。 -**批量提交解析任务(并行)** +### 3. PDF -> v3 layout JSON -- 对每个 `pdf_path` 调用:`doc2x_parse_pdf_submit({ pdf_path, model? })` → `{ uid }` +当用户要 figure/table 坐标、block bbox、layout blocks、后续裁剪脚本输入时使用: -**等待解析完成(并行)** +- 优先:`doc2x_materialize_pdf_layout_json({ uid | pdf_path, output_path, model? })` -- 对每个 `uid` 轮询:`doc2x_parse_pdf_status({ uid })` 直到 `status="success"` -- 若 `status="failed"`:汇报 `detail`,该文件停止后续步骤 +要向用户说明 `layout` 的用途: -**导出目标格式(并行,按 uid)** +- `Markdown/text` 适合阅读正文;`layout` 适合程序继续处理页面结构 +- `layout.blocks[].bbox` 可用于 figure/table 裁剪、区域截图、框选高亮、可视化调试 +- `layout.blocks[].type` 可用于区分标题、正文、表格、图片等块,做结构化抽取 +- `layout` 适合作为后续脚本输入,例如 figure/table crop、block 对齐、版面分析 +- 如果用户只想“看内容”,优先给 Markdown / DOCX;如果用户要“知道内容在页面哪里”,就用 `layout` -推荐用 `doc2x_convert_export_wait` 走“兼容模式一键导出”(当你提供 `formula_mode` 且本进程未提交过该导出时,会自动 submit 一次,然后 wait),避免你手动拆成 submit+wait: +行为要求: -- DOCX:`doc2x_convert_export_wait({ uid, to: "docx", formula_mode: "normal", formula_level? })` → `{ status: "success", url }` -- Markdown:`doc2x_convert_export_wait({ uid, to: "md", formula_mode: "normal", formula_level?, filename?, filename_mode? })` → `{ status: "success", url }` -- LaTeX:`doc2x_convert_export_wait({ uid, to: "tex", formula_mode: "dollar", formula_level? })` → `{ status: "success", url }` +- 走 `pdf_path` 分支时,默认使用 `v3-2026` +- 输出的是原始 parse `result` JSON,而不是精简文本 +- 若返回结果缺少 `pages[].layout`,应视为失败而不是静默降级 -(或显式两步:`doc2x_convert_export_submit(...)` → `doc2x_convert_export_wait({ uid, to })`) +### 4. 图片 -> 版面结果 -**补充建议** +- 直接拿原始结果:`doc2x_parse_image_layout_sync({ image_path })` +- 等待并取首屏 Markdown:`doc2x_parse_image_layout_submit({ image_path })` -> `doc2x_parse_image_layout_wait_text({ uid })` +- 结果包含 `convert_zip` 且用户要资源落盘时:`doc2x_materialize_convert_zip({ convert_zip_base64, output_dir })` -- `formula_mode` 是关键参数:建议总是显式传入(`"normal"` / `"dollar"`,按用户偏好选择;常见:`md/docx` 用 `"normal"`、`tex` 用 `"dollar"`) -- 需要做公式退化时显式传 `formula_level`(`0/1/2`);若不需要退化,建议显式传 `0`,避免调用端默认值歧义 -- `filename`/`filename_mode` 主要用于 `md/tex`:传不带扩展名的 basename,并配合 `filename_mode: "auto"`(避免 `name.md.md` / `name.tex.tex`) -- 对同一个 `uid` 做多格式导出时,先确定顺序(例如先 md 再 docx),逐个完成再进行下一个格式 -- 对同一个 `uid` 的同一格式做“多档参数对比”(如 `formula_level`),每一档都要先下载再进行下一档,避免覆盖导致误判 +### 5. 批量 PDF -**批量下载(并行)** +批量场景采用流水线,不要全串行: -- `doc2x_download_url_to_file({ url, output_path })` → `{ output_path, bytes_written }` -- `output_path` 必须为绝对路径,且每个文件应唯一(建议用原文件名 + 对应扩展名:`.md` / `.tex` / `.docx`) +1. 多个 `pdf_path` 可并行 `doc2x_parse_pdf_submit` +2. 多个 `uid` 可并行 `doc2x_parse_pdf_status` +3. 某个 `uid` 一旦 parse 成功,立即开始它自己的导出和下载 +4. 不同 `uid` 可并行导出 +5. 同一个 `uid` 的同一种导出配置不要并发 -**并发建议** +## 向用户回报 -- 10 个 PDF 以内通常可以直接并行;更多文件建议分批/限流(避免触发超时/限流) +- 成功时报告:输入文件、`uid`、输出路径、必要时 `bytes_written` +- 失败时报告:错误码、错误消息、相关 `uid`,并指出哪些文件未受影响 +- 当用户目标是“本地文件”时,优先回报落盘结果,不要只贴长文本 -**向用户回报(按文件汇总)** +## 常见错误处理 -- 成功:列出每个输入文件对应的 `output_path` 与 `bytes_written` -- 失败:列出失败文件与错误原因(包含 `uid` 与 `detail`/错误码),并说明其余文件不受影响 - -### 工作流 B:PDF → Markdown 文件(推荐) - -当用户目标是“拿到完整 Markdown / 落盘”,主链路应当是导出与下载,不要依赖 `doc2x_parse_pdf_wait_text`。 - -**提交解析任务** - -- `doc2x_parse_pdf_submit({ pdf_path, model? })` → `{ uid }` - -**等待解析完成** - -- 轮询 `doc2x_parse_pdf_status({ uid })` 直到 `status="success"`(失败则带 `detail` 汇报) - -**导出 Markdown** - -- `doc2x_convert_export_wait({ uid, to: "md", formula_mode: "normal", formula_level?, filename?, filename_mode? })` → `{ status: "success", url }` - -**下载落盘** - -- `doc2x_download_url_to_file({ url, output_path })` → `{ output_path, bytes_written }` - -**向用户回报** - -- 回复用户:保存路径、文件大小、`uid`(必要时附上 `url`) - -### 工作流 C:PDF → 文本预览(可控长度) - -当用户只需要“摘要/少量预览”时才用: - -- `doc2x_parse_pdf_wait_text({ pdf_path | uid, max_output_chars?, max_output_pages? })` - -如果返回包含截断提示(`[doc2x-mcp] Output truncated ...`),应切换到“工作流 B”导出 md 获取完整内容。 - -### 工作流 D:PDF 导出格式(MD / TEX / DOCX) - -- Markdown:`to="md"`(完整 Markdown 导出优先参考“工作流 B”) -- LaTeX:`to="tex"` -- Word:`to="docx"` -- 调用链同“工作流 A / B”(先解析 → 再导出 → 再下载),按目标格式调整 `to`(并按需设置 `formula_mode/formula_level/filename`) -- 注意:`doc2x_convert_export_submit.formula_mode` 必填(`"normal"` 或 `"dollar"`);`formula_level` 可选(`0/1/2`) -- 若需要对比不同 `formula_level`,请按顺序执行并在每次导出成功后立即下载,再进行下一档,避免后一次结果覆盖前一次。 - -### 工作流 E:图片 → Markdown(版面解析) - -- 只要结果(同步):`doc2x_parse_image_layout_sync({ image_path })`(返回原始 JSON,可能包含 `convert_zip`) -- 要首屏 markdown(异步):`doc2x_parse_image_layout_submit({ image_path })` → `doc2x_parse_image_layout_wait_text({ uid })` - -如果结果里有 `convert_zip`(base64)且用户希望落盘资源文件: - -- `doc2x_materialize_convert_zip({ convert_zip_base64, output_dir })` → `{ output_dir, zip_path, extracted }` - -## 失败与排错(你应当这样处理) - -1. 鉴权/配置异常 - 先 `doc2x_debug_config()`,确认 `apiKeyLen > 0` 且 `baseUrl/httpTimeoutMs/pollIntervalMs/maxWaitMs` 合理。 - -2. 等待超时 - 建议用户调大 `DOC2X_MAX_WAIT_MS` 或按需调 `DOC2X_POLL_INTERVAL_MS`(不要过于频繁)。 - -3. 下载被阻止(安全策略) - `doc2x_download_url_to_file` 只允许 `https` 且要求 host 在 `DOC2X_DOWNLOAD_URL_ALLOWLIST` 内;被拦截时解释原因,并让用户选择“加 allowlist”或“保持默认安全策略”。 - -4. 用户给的是相对路径/不确定路径 - 要求用户提供绝对路径;不要猜。 +1. 缺参数或路径不合法:提示用户提供绝对路径,不要猜测相对路径。 +2. 等待超时:说明可调大 `DOC2X_MAX_WAIT_MS` 或适度调整轮询间隔。 +3. 下载被策略拦截:解释是 `DOC2X_DOWNLOAD_URL_ALLOWLIST` 限制,不要绕过。 +4. 认证或配置问题:调用 `doc2x_debug_config`,只汇报 `apiKeySource/apiKeyPrefix/apiKeyLen` 等摘要。 diff --git a/src/doc2x/materialize.ts b/src/doc2x/materialize.ts index 58c6fdb..3737cff 100644 --- a/src/doc2x/materialize.ts +++ b/src/doc2x/materialize.ts @@ -2,6 +2,9 @@ import fsp from 'node:fs/promises'; import path from 'node:path'; import { spawn } from 'node:child_process'; +import { ToolError } from '#errors'; +import { TOOL_ERROR_CODE_INVALID_JSON } from '#errorCodes'; + function spawnUnzip(zipPath: string, outputDir: string): Promise { return new Promise((resolve) => { const child = spawn('unzip', ['-o', zipPath, '-d', outputDir], { stdio: 'ignore' }); @@ -10,6 +13,49 @@ function spawnUnzip(zipPath: string, outputDir: string): Promise { }); } +function isRecord(value: unknown): value is Record { + return value !== null && typeof value === 'object' && !Array.isArray(value); +} + +export function validatePdfLayoutResult(result: unknown, uid?: string) { + if (!isRecord(result)) + throw new ToolError({ + code: TOOL_ERROR_CODE_INVALID_JSON, + message: 'parse result must be a JSON object', + retryable: false, + uid, + }); + + const pages = result.pages; + if (!Array.isArray(pages) || pages.length === 0) + throw new ToolError({ + code: TOOL_ERROR_CODE_INVALID_JSON, + message: 'parse result must contain a non-empty pages array', + retryable: false, + uid, + }); + + for (let i = 0; i < pages.length; i++) { + const page = pages[i]; + if (!isRecord(page)) + throw new ToolError({ + code: TOOL_ERROR_CODE_INVALID_JSON, + message: `pages[${i}] must be an object`, + retryable: false, + uid, + }); + if (!isRecord(page.layout)) + throw new ToolError({ + code: TOOL_ERROR_CODE_INVALID_JSON, + message: `pages[${i}].layout must be an object`, + retryable: false, + uid, + }); + } + + return { result, pageCount: pages.length, hasLayout: true as const }; +} + export async function materializeConvertZip(args: { convert_zip_base64: string; output_dir: string; @@ -22,3 +68,20 @@ export async function materializeConvertZip(args: { const extracted = await spawnUnzip(zipPath, outDir); return { output_dir: outDir, zip_path: zipPath, extracted }; } + +export async function materializePdfLayoutJson(args: { + result: unknown; + output_path: string; + uid?: string; +}) { + const validated = validatePdfLayoutResult(args.result, args.uid); + const outputPath = path.resolve(args.output_path); + await fsp.mkdir(path.dirname(outputPath), { recursive: true }); + await fsp.writeFile(outputPath, `${JSON.stringify(validated.result, null, 2)}\n`, 'utf8'); + return { + uid: args.uid ?? '', + output_path: outputPath, + page_count: validated.pageCount, + has_layout: validated.hasLayout, + }; +} diff --git a/src/doc2x/pdf.ts b/src/doc2x/pdf.ts index 5fe6c95..1a602ff 100644 --- a/src/doc2x/pdf.ts +++ b/src/doc2x/pdf.ts @@ -16,7 +16,9 @@ import { DOC2X_TASK_STATUS_FAILED, DOC2X_TASK_STATUS_SUCCESS } from '#doc2x/cons import { HTTP_METHOD_GET, HTTP_METHOD_POST } from '#doc2x/http'; import { v2 } from '#doc2x/paths'; -export const PARSE_PDF_MODELS = ['v3-2026'] as const; +export const PARSE_PDF_MODEL_V2 = 'v2' as const; +export const PARSE_PDF_MODEL_V3 = 'v3-2026' as const; +export const PARSE_PDF_MODELS = [PARSE_PDF_MODEL_V2, PARSE_PDF_MODEL_V3] as const; export type ParsePdfModel = (typeof PARSE_PDF_MODELS)[number]; type Doc2xPageResult = { page_idx?: unknown; md?: unknown }; type Doc2xParseResult = { pages?: Doc2xPageResult[] }; @@ -141,17 +143,13 @@ export async function parsePdfStatus(uid: string) { }; } -export async function parsePdfWaitTextByUid(args: { +async function waitForParsePdfSuccessByUid(args: { uid: string; poll_interval_ms?: number; max_wait_ms?: number; - join_with?: string; - max_output_chars?: number; - max_output_pages?: number; }) { const pollInterval = args.poll_interval_ms ?? CONFIG.pollIntervalMs; const maxWait = args.max_wait_ms ?? CONFIG.maxWaitMs; - const joinWith = args.join_with ?? '\n\n---\n\n'; const uid = String(args.uid || '').trim(); if (!uid) @@ -182,13 +180,7 @@ export async function parsePdfWaitTextByUid(args: { } throw e; } - if (st.status === DOC2X_TASK_STATUS_SUCCESS) { - const merged = mergePagesToTextWithLimit(st.result, joinWith, { - maxOutputChars: args.max_output_chars, - maxOutputPages: args.max_output_pages, - }); - return { uid, status: DOC2X_TASK_STATUS_SUCCESS, ...merged }; - } + if (st.status === DOC2X_TASK_STATUS_SUCCESS) return st; if (st.status === DOC2X_TASK_STATUS_FAILED) throw new ToolError({ code: TOOL_ERROR_CODE_PARSE_FAILED, @@ -199,3 +191,29 @@ export async function parsePdfWaitTextByUid(args: { await sleep(pollInterval); } } + +export async function parsePdfWaitResultByUid(args: { + uid: string; + poll_interval_ms?: number; + max_wait_ms?: number; +}) { + const st = await waitForParsePdfSuccessByUid(args); + return { uid: st.uid, status: DOC2X_TASK_STATUS_SUCCESS, result: st.result }; +} + +export async function parsePdfWaitTextByUid(args: { + uid: string; + poll_interval_ms?: number; + max_wait_ms?: number; + join_with?: string; + max_output_chars?: number; + max_output_pages?: number; +}) { + const joinWith = args.join_with ?? '\n\n---\n\n'; + const st = await waitForParsePdfSuccessByUid(args); + const merged = mergePagesToTextWithLimit(st.result, joinWith, { + maxOutputChars: args.max_output_chars, + maxOutputPages: args.max_output_pages, + }); + return { uid: st.uid, status: DOC2X_TASK_STATUS_SUCCESS, ...merged }; +} diff --git a/src/mcp/registerPdfTools.ts b/src/mcp/registerPdfTools.ts index 71ea002..0608e36 100644 --- a/src/mcp/registerPdfTools.ts +++ b/src/mcp/registerPdfTools.ts @@ -3,16 +3,21 @@ import type { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js'; import { CONFIG } from '#config'; import { isRetryableError } from '#errors'; import { + PARSE_PDF_MODEL_V2, + PARSE_PDF_MODEL_V3, type ParsePdfModel, parsePdfStatus, parsePdfSubmit, + parsePdfWaitResultByUid, parsePdfWaitTextByUid, } from '#doc2x/pdf'; +import { materializePdfLayoutJson } from '#doc2x/materialize'; import { asJsonResult, asTextResult } from '#mcp/results'; import { deleteUidCache, fileSig, getSubmittedUidFromCache, + jsonOutputPathSchema, joinWithSchema, makePdfUidCacheKey, missingEitherFieldError, @@ -37,7 +42,7 @@ export function registerPdfTools(server: McpServer, ctx: RegisterToolsContext) { inputSchema: { pdf_path: pdfPathSchema, model: parsePdfModelSchema.describe( - "Optional parse model. Use 'v3-2026' to try the latest model. Omit this field to use default v2.", + `Optional parse model. Supported values: '${PARSE_PDF_MODEL_V2}' and '${PARSE_PDF_MODEL_V3}'. Omit this field to use default ${PARSE_PDF_MODEL_V2}.`, ), }, }, @@ -92,7 +97,7 @@ export function registerPdfTools(server: McpServer, ctx: RegisterToolsContext) { ), model: parsePdfModelSchema .describe( - "Optional parse model used only when submitting from pdf_path. Use 'v3-2026' to try latest model. Omit this field to use default v2.", + `Optional parse model used only when submitting from pdf_path. Supported values: '${PARSE_PDF_MODEL_V2}' and '${PARSE_PDF_MODEL_V3}'. Omit this field to use default ${PARSE_PDF_MODEL_V2}.`, ), }, }, @@ -186,4 +191,86 @@ export function registerPdfTools(server: McpServer, ctx: RegisterToolsContext) { }, ), ); + + server.registerTool( + 'doc2x_materialize_pdf_layout_json', + { + description: + `Wait for a PDF parse task and write the raw Doc2x result JSON (with page layout) to output_path. Prefer passing uid. If only pdf_path is provided, this tool reuses a cached uid or submits a new parse with model='${PARSE_PDF_MODEL_V3}' by default.`, + inputSchema: { + uid: parsePdfUidSchema.optional(), + pdf_path: pdfPathForWaitSchema.optional(), + output_path: jsonOutputPathSchema, + poll_interval_ms: positiveIntMsSchema.optional(), + max_wait_ms: positiveIntMsSchema.optional(), + model: parsePdfModelSchema + .describe( + `Optional parse model used only when submitting from pdf_path. Supported values: '${PARSE_PDF_MODEL_V2}' and '${PARSE_PDF_MODEL_V3}'. Defaults to '${PARSE_PDF_MODEL_V3}' for this tool because ${PARSE_PDF_MODEL_V2} does not return layout.`, + ), + }, + }, + withToolErrorHandling( + async (args: { + uid?: string; + pdf_path?: string; + output_path: string; + poll_interval_ms?: number; + max_wait_ms?: number; + model?: ParsePdfModel; + }) => { + const materializeByUid = async (uid: string) => { + const out = await parsePdfWaitResultByUid({ + uid, + poll_interval_ms: args.poll_interval_ms, + max_wait_ms: args.max_wait_ms, + }); + return await materializePdfLayoutJson({ + uid: out.uid, + result: out.result, + output_path: args.output_path, + }); + }; + + const uid = String(args.uid || '').trim(); + if (uid) return asJsonResult(await materializeByUid(uid)); + + const pdfPath = String(args.pdf_path || '').trim(); + if (!pdfPath) throw missingEitherFieldError('uid', 'pdf_path'); + + const sig = await fileSig(pdfPath); + const model: ParsePdfModel = args.model ?? PARSE_PDF_MODEL_V3; + const cacheKey = makePdfUidCacheKey(sig.absPath, model); + const resolvedUid = getSubmittedUidFromCache(ctx, { kind: 'pdf', key: cacheKey, sig }); + const finalUid = resolvedUid || (await parsePdfSubmit(pdfPath, { model })).uid; + setSubmittedUidCache(ctx, { kind: 'pdf', key: cacheKey, sig, uid: finalUid }); + + const markFailed = (failedUid: string) => + setFailedUidCache(ctx, { kind: 'pdf', key: cacheKey, sig, uid: failedUid }); + + try { + return asJsonResult(await materializeByUid(finalUid)); + } catch (e) { + if (!resolvedUid) { + markFailed(finalUid); + throw e; + } + + deleteUidCache(ctx, { kind: 'pdf', key: cacheKey }); + if (!isRetryableError(e)) { + markFailed(finalUid); + throw e; + } + + const retryUid = (await parsePdfSubmit(pdfPath, { model })).uid; + setSubmittedUidCache(ctx, { kind: 'pdf', key: cacheKey, sig, uid: retryUid }); + try { + return asJsonResult(await materializeByUid(retryUid)); + } catch (retryErr) { + markFailed(retryUid); + throw retryErr; + } + } + }, + ), + ); } diff --git a/src/mcp/registerToolsShared.ts b/src/mcp/registerToolsShared.ts index 06c0852..6806835 100644 --- a/src/mcp/registerToolsShared.ts +++ b/src/mcp/registerToolsShared.ts @@ -6,7 +6,7 @@ import { LRUCache } from 'lru-cache'; import { z } from 'zod'; import { CONVERT_FORMULA_LEVELS, type ConvertFormulaLevel } from '#doc2x/convert'; -import { PARSE_PDF_MODELS, type ParsePdfModel } from '#doc2x/pdf'; +import { PARSE_PDF_MODEL_V2, PARSE_PDF_MODELS, type ParsePdfModel } from '#doc2x/pdf'; import { ToolError } from '#errors'; import { TOOL_ERROR_CODE_INVALID_ARGUMENT } from '#errorCodes'; import { asErrorResult } from '#mcp/results'; @@ -160,8 +160,8 @@ export function sameSig(a: FileSig, b: FileSig): boolean { return a.md5 === b.md5; } -function normalizeParsePdfModel(model?: ParsePdfModel): ParsePdfModel | 'v2' { - return model ?? 'v2'; +function normalizeParsePdfModel(model?: ParsePdfModel): ParsePdfModel { + return model ?? PARSE_PDF_MODEL_V2; } export function makePdfUidCacheKey(absPath: string, model?: ParsePdfModel): string { @@ -263,6 +263,12 @@ export const outputPathSchema = absolutePathSchema.describe( 'Absolute path for the output file. The file will be overwritten if it exists.', ); +export const jsonOutputPathSchema = absolutePathSchema + .refine((v) => v.toLowerCase().endsWith('.json'), { + message: "Path must end with '.json'.", + }) + .describe('Absolute path for the output JSON file. The file will be overwritten if it exists.'); + export const doc2xDownloadUrlSchema = z .string() .trim() diff --git a/test/e2e/mcpServer.e2e.test.js b/test/e2e/mcpServer.e2e.test.js index 0d23971..242ffdf 100644 --- a/test/e2e/mcpServer.e2e.test.js +++ b/test/e2e/mcpServer.e2e.test.js @@ -50,6 +50,7 @@ test('stdio e2e: list tools and basic error/result paths', async (t) => { const toolNames = new Set(tools.tools.map((x) => x.name)); assert.ok(toolNames.has('doc2x_debug_config')); assert.ok(toolNames.has('doc2x_parse_pdf_wait_text')); + assert.ok(toolNames.has('doc2x_materialize_pdf_layout_json')); assert.ok(toolNames.has('doc2x_parse_image_layout_wait_text')); const debug = await client.callTool({ name: 'doc2x_debug_config', arguments: {} }); @@ -64,6 +65,14 @@ test('stdio e2e: list tools and basic error/result paths', async (t) => { const pdfWaitPayload = JSON.parse(firstText(pdfWait)); assert.equal(pdfWaitPayload.error.code, TOOL_ERROR_CODE_INVALID_ARGUMENT); + const pdfLayout = await client.callTool({ + name: 'doc2x_materialize_pdf_layout_json', + arguments: { output_path: path.resolve(cwd, 'test/out/layout.json') }, + }); + assert.equal(pdfLayout.isError, true); + const pdfLayoutPayload = JSON.parse(firstText(pdfLayout)); + assert.equal(pdfLayoutPayload.error.code, TOOL_ERROR_CODE_INVALID_ARGUMENT); + const imageWait = await client.callTool({ name: 'doc2x_parse_image_layout_wait_text', arguments: {} }); assert.equal(imageWait.isError, true); const imageWaitPayload = JSON.parse(firstText(imageWait)); diff --git a/test/unit/materialize.test.js b/test/unit/materialize.test.js new file mode 100644 index 0000000..6a0fcf4 --- /dev/null +++ b/test/unit/materialize.test.js @@ -0,0 +1,60 @@ +import assert from 'node:assert/strict'; +import fsp from 'node:fs/promises'; +import os from 'node:os'; +import path from 'node:path'; +import test from 'node:test'; + +import { TOOL_ERROR_CODE_INVALID_JSON } from '../../dist/errors/errorCodes.js'; +import { + materializePdfLayoutJson, + validatePdfLayoutResult, +} from '../../dist/doc2x/materialize.js'; + +test('validatePdfLayoutResult accepts result with per-page layout objects', () => { + const out = validatePdfLayoutResult({ + pages: [ + { page_idx: 0, layout: { blocks: [] } }, + { page_idx: 1, layout: { blocks: [{ id: 'b1', type: 'Text' }] } }, + ], + }); + + assert.equal(out.pageCount, 2); + assert.equal(out.hasLayout, true); +}); + +test('validatePdfLayoutResult rejects pages without layout', () => { + assert.throws( + () => + validatePdfLayoutResult({ + pages: [{ page_idx: 0, md: '# no layout' }], + }), + (err) => { + assert.equal(err.code, TOOL_ERROR_CODE_INVALID_JSON); + assert.match(err.message, /pages\[0\]\.layout must be an object/); + return true; + }, + ); +}); + +test('materializePdfLayoutJson writes raw result JSON to output_path', async () => { + const tempDir = await fsp.mkdtemp(path.join(os.tmpdir(), 'doc2x-mcp-layout-')); + const outputPath = path.join(tempDir, 'result.layout.json'); + const result = { + version: 'v1', + pages: [{ page_idx: 0, layout: { blocks: [{ id: 'block-1', type: 'Figure' }] } }], + }; + + const out = await materializePdfLayoutJson({ + uid: 'uid-123', + result, + output_path: outputPath, + }); + + assert.equal(out.uid, 'uid-123'); + assert.equal(out.output_path, outputPath); + assert.equal(out.page_count, 1); + assert.equal(out.has_layout, true); + + const saved = JSON.parse(await fsp.readFile(outputPath, 'utf8')); + assert.deepEqual(saved, result); +}); diff --git a/test/unit/registerToolsShared.test.js b/test/unit/registerToolsShared.test.js index a5a1750..cfb0dd9 100644 --- a/test/unit/registerToolsShared.test.js +++ b/test/unit/registerToolsShared.test.js @@ -8,11 +8,13 @@ import { doc2xDownloadUrlSchema, fileSig, getSubmittedUidFromCache, + jsonOutputPathSchema, makePdfUidCacheKey, imagePathSchema, makeConvertSubmitKey, missingEitherFieldError, outputPathSchema, + parsePdfModelSchema, pdfPathSchema, setFailedUidCache, setSubmittedUidCache, @@ -60,6 +62,7 @@ test('path schemas enforce absolute and extension constraints', () => { const badImage = path.resolve('/tmp/a.gif'); const relativeOut = 'tmp/out.md'; const absoluteOut = path.resolve('/tmp/out.md'); + const absoluteJsonOut = path.resolve('/tmp/out.json'); assert.equal(pdfPathSchema.safeParse(goodPdf).success, true); assert.equal(pdfPathSchema.safeParse(badPdf).success, false); @@ -71,6 +74,8 @@ test('path schemas enforce absolute and extension constraints', () => { assert.equal(outputPathSchema.safeParse(absoluteOut).success, true); assert.equal(outputPathSchema.safeParse(relativeOut).success, false); + assert.equal(jsonOutputPathSchema.safeParse(absoluteJsonOut).success, true); + assert.equal(jsonOutputPathSchema.safeParse(absoluteOut).success, false); }); test('download URL schema only allows http/https', () => { @@ -79,11 +84,20 @@ test('download URL schema only allows http/https', () => { assert.equal(doc2xDownloadUrlSchema.safeParse('ftp://example.com/file').success, false); }); +test('parse pdf model schema allows explicit v2 and v3-2026', () => { + assert.equal(parsePdfModelSchema.safeParse('v2').success, true); + assert.equal(parsePdfModelSchema.safeParse('v3-2026').success, true); + assert.equal(parsePdfModelSchema.safeParse('v1').success, false); +}); + test('pdf uid cache hits for same signature from test/pdf/test.pdf', async () => { const ctx = createRegisterToolsContext(); const pdfPath = path.resolve(process.cwd(), 'test/pdf/test.pdf'); const sig1 = await fileSig(pdfPath); const key = makePdfUidCacheKey(sig1.absPath); + const explicitV2Key = makePdfUidCacheKey(sig1.absPath, 'v2'); + + assert.equal(key, explicitV2Key); assert.equal(getSubmittedUidFromCache(ctx, { kind: 'pdf', key, sig: sig1 }), '');