Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
8254b38
[Doc2X] Add support to img APIs
Menghuan1918 Feb 10, 2025
68dd68c
[Doc2X] Finish the pic deal part
Menghuan1918 Feb 13, 2025
790d8a4
[Doc2X] Fix many bugs in images process
Menghuan1918 Feb 14, 2025
dee113c
[Doc2X] Finished the pic ocr part
Menghuan1918 Feb 19, 2025
83d100c
[Doc2X] Add support for merge_cross_page_forms
Menghuan1918 Feb 19, 2025
9c00c8e
Update to 1.0.3
Menghuan1918 Feb 19, 2025
21a1e85
Add default limit
Menghuan1918 Feb 19, 2025
018e09f
delete ocr
Jun 4, 2025
6c0112f
还原
guixinW Jun 5, 2025
2ab4794
修改了V2中的parse_image_layout,原本该函数无法处理多个相同文件名的pic,因为zip必须不同名
guixinW Jun 5, 2025
e4974ed
保留了pdf2file中的ocr;另外增加了piclayout接口封装了ImageProcessor中的内容用于处理picture
guixinW Jun 5, 2025
8d5ed90
增加了_get_lock,在ImageProcessor开启事件循环时才创建锁
guixinW Jun 5, 2025
8783f3c
还原
guixinW Jun 5, 2025
e7df1ed
利用添加的piclayout接口完成测试,修改了测试逻辑
guixinW Jun 5, 2025
8e0c0e3
将img大小限制修改为<=7MB
guixinW Jun 6, 2025
ad6d838
修复了请求被限速后再次加入滑动窗口会添加新旧两个时间戳的bug;优化了滑动窗口在临界区sleep导致后续请求序列化的情况
guixinW Jun 6, 2025
0a57955
修改highRPM测试次数为31用来测试接口30s/30req的限速是否正确
guixinW Jun 6, 2025
0313f97
修复了多线程判断文件是否存在造成的文件覆盖问题
guixinW Jun 6, 2025
2f03d38
Merge branch 'dev' into main
guixinW Jun 6, 2025
4856662
Merge pull request #73 from guixinW/main
guixinW Jun 6, 2025
7f46251
统一了pdf2file和pic2file的接口形式
guixinW Jun 9, 2025
3167cdb
pdf2file增加了save_subdir选项
guixinW Jun 9, 2025
2bab32b
修复了doc2x_img.py的代码问题;增加了subdir选项的测试
guixinW Jun 9, 2025
00f0eab
为pdf2file添加了json字段,用于将pdf的信息直接保存为json文件
guixinW Jun 9, 2025
c7d004a
加入export_history关键字,记录下载状态,对于已经导出的文件无需再导出,未导出的类则轮询状态后导出
guixinW Jun 10, 2025
19c1867
Merge pull request #74 from guixinW/main
Muxv Jun 11, 2025
34b7f34
修改了export_history存在的一些错误;弃用了asyncio.Lock(loop=self._loop)
guixinW Jun 11, 2025
16fa079
Merge pull request #75 from guixinW/main
Muxv Jun 11, 2025
89993bb
support save_subdir for json
Muxv Jun 11, 2025
d0fffc2
add file_type as suffix to md/tex zip save name
Muxv Jun 11, 2025
5b96ece
simplify piclayout parameters
Muxv Jun 13, 2025
a750abf
add image path to text output
Muxv Jun 13, 2025
c0d8547
add zip path to text output
Muxv Jun 13, 2025
ef25c58
add zip output_format for piclayout
Muxv Jun 14, 2025
6578ed4
add subdir to piclayout and use tmp path to pytest
Muxv Jun 14, 2025
4a02b87
Merge pull request #76 from NoEdgeAI/feat/record
Muxv Jun 14, 2025
bdff859
feat: add v3 preupload model support for v2 client and CLI
HSn0918 Feb 9, 2026
08b4d4c
feat(doc2x): add formula_level enum support for v2 export
HSn0918 Feb 11, 2026
b9d39a5
update: add formula level
HSn0918 Feb 11, 2026
d5fb1e6
fix: update Base_URL to the correct v2 API endpoint
HSn0918 Feb 23, 2026
bbb779a
feat(test): add formula level enum tests for v3 model support
HSn0918 Feb 24, 2026
f296010
feat(test): enhance pic2file tests with v3 model integration and erro…
HSn0918 Feb 24, 2026
b3a4a98
feat(model): add V2 support and enhance normalization for V3 model
HSn0918 Feb 28, 2026
0905810
feat: add v3 json sidecars and crop helpers
circlestarzero Mar 12, 2026
da07b3e
chore: ignore local history and tmp outputs
circlestarzero Mar 12, 2026
2761602
docs: add import examples for v3 crop helpers
circlestarzero Mar 12, 2026
96b7bca
refactor: move v3 crop helpers into package module
circlestarzero Mar 12, 2026
bb0dd88
fix(deps): bump vulnerable dependencies to patched versions
HSn0918 Mar 16, 2026
e1884a5
remove unfinished export_history from public pdf2file API
HSn0918 Mar 16, 2026
4695875
remove: remove uv lock
HSn0918 Mar 16, 2026
e75e99c
chore: drop Python 3.8 support, require >=3.9
HSn0918 Mar 17, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ node_modules
my-docs
Output
self_use.py
.history/
tmp/
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down
85 changes: 85 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,8 @@ success, failed, flag = client.pdf2file(
pdf_file="tests/pdf",
output_path="./Output",
output_format="docx",
model="v3-2026", # optional, default is server-side v2
formula_level=1, # optional: 0(default/recommended)=keep formulas; 1=inline formulas -> text; 2=all formulas (inline+block) -> text
)
print(success)
print(failed)
Expand All @@ -127,4 +129,87 @@ print(failed)
print(flag)
```

### V3 JSON updates

When `model="v3-2026"`:

- `output_format="json"` now saves the raw Doc2X v3 JSON (`result.pages...`) instead of the legacy simplified `[{text, location}]` structure.
- Raw v3 JSON is always saved as a sidecar `.json` file, even when `output_format` does not include `json` (for example `text`, `detailed`, `md`, `docx`).
- If `output_format` includes `json`, the sidecar JSON name follows the `json` slot in `output_names`.
- If `output_format` does not include `json`, the sidecar JSON name follows the first non-empty entry in `output_names`.
- If `output_names` is omitted, the sidecar JSON falls back to the original PDF basename.
- Deprecated direct upload is no longer used. `oss_choose="always"` and `oss_choose="auto"` both use the preupload API. `oss_choose="never"` / `oss_choose="none"` now raises an error.

Example:

```python
from pdfdeal import Doc2X

client = Doc2X(apikey="Your API key", debug=True)
success, failed, flag = client.pdf2file(
pdf_file="tests/pdf/sample.pdf",
output_path="./Output/test/v3",
output_format="text,json",
output_names=[["plain.txt", "viz.data"]],
model="v3-2026",
)
print(success) # ["page text...", "./Output/test/v3/viz.json"]
print(failed)
print(flag)
```

### Helper scripts for v3 figure/table crops

Two helper scripts were added under [`scripts/`](/Users/cc/work/NoEdgeAI/pdfdeal/scripts):

- [`extract_v3_figures.py`](/Users/cc/work/NoEdgeAI/pdfdeal/scripts/extract_v3_figures.py): extract figure crops from a PDF using Doc2X v3 JSON
- [`extract_v3_tables.py`](/Users/cc/work/NoEdgeAI/pdfdeal/scripts/extract_v3_tables.py): extract table crops from a PDF using Doc2X v3 JSON

Both scripts:

- validate that the v3 JSON matches the crop rules first
- render only pages containing target blocks with `fitz` at the requested `dpi`
- save full-page PNGs under `_pages/`
- crop target regions using the block `bbox/xyxy` and page coordinates from the v3 JSON
- write `manifest.json` with crop metadata

Examples:

```bash
python scripts/extract_v3_figures.py \
--pdf /path/to/input.pdf \
--v3-json /path/to/input_v3.json \
--dpi 200 \
--output-dir ./Output/figures
```

```bash
python scripts/extract_v3_tables.py \
--pdf /path/to/input.pdf \
--v3-json /path/to/input_v3.json \
--dpi 200 \
--output-dir ./Output/tables
```

You can also import the helpers directly:

```python
from pdfdeal import extract_v3_figure_images, extract_v3_table_images

figure_summary = extract_v3_figure_images(
pdf_path="/path/to/input.pdf",
v3_json_path="/path/to/input_v3.json",
dpi=200,
output_dir="./Output/figures",
)
table_summary = extract_v3_table_images(
pdf_path="/path/to/input.pdf",
v3_json_path="/path/to/input_v3.json",
dpi=200,
output_dir="./Output/tables",
)
print(figure_summary["crop_count"], figure_summary["manifest_path"])
print(table_summary["crop_count"], table_summary["manifest_path"])
```

See the online documentation for details.
87 changes: 86 additions & 1 deletion README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,8 @@ success, failed, flag = client.pdf2file(
pdf_file="tests/pdf",
output_path="./Output",
output_format="docx",
model="v3-2026", # 可选,不填则使用服务端默认 v2
formula_level=1, # 可选:0(默认,推荐)不降级;1 仅降级行内公式(\(...\)、$...$);2 降级所有公式(含 \[...\]、$$...$$)
)
print(success)
print(failed)
Expand All @@ -125,4 +127,87 @@ print(failed)
print(flag)
```

更多详细请参见在线文档。
### V3 JSON 更新

当 `model="v3-2026"` 时:

- `output_format="json"` 现在会保存 Doc2X 原始 v3 JSON(`result.pages...`),不再保存旧的简化 `[{text, location}]` 结构。
- 即使 `output_format` 不包含 `json`(例如 `text`、`detailed`、`md`、`docx`),也会额外保存一份 sidecar `.json`。
- 如果 `output_format` 包含 `json`,sidecar JSON 的命名会跟随 `output_names` 里 `json` 这一槽位。
- 如果 `output_format` 不包含 `json`,sidecar JSON 的命名会跟随 `output_names` 里第一个非空名字。
- 如果没有传 `output_names`,sidecar JSON 会回退到原 PDF 文件名。
- 已不再使用过期的小文件直传。`oss_choose="always"` 和 `oss_choose="auto"` 都会走 preupload;`oss_choose="never"` / `oss_choose="none"` 会直接报错。

示例:

```python
from pdfdeal import Doc2X

client = Doc2X(apikey="Your API key", debug=True)
success, failed, flag = client.pdf2file(
pdf_file="tests/pdf/sample.pdf",
output_path="./Output/test/v3",
output_format="text,json",
output_names=[["plain.txt", "viz.data"]],
model="v3-2026",
)
print(success) # ["页面文本...", "./Output/test/v3/viz.json"]
print(failed)
print(flag)
```

### V3 figure/table 裁剪辅助脚本

在 [`scripts/`](/Users/cc/work/NoEdgeAI/pdfdeal/scripts) 下新增了两个辅助脚本:

- [`extract_v3_figures.py`](/Users/cc/work/NoEdgeAI/pdfdeal/scripts/extract_v3_figures.py):基于 Doc2X v3 JSON 从 PDF 中裁剪 figure 图片
- [`extract_v3_tables.py`](/Users/cc/work/NoEdgeAI/pdfdeal/scripts/extract_v3_tables.py):基于 Doc2X v3 JSON 从 PDF 中裁剪 table 图片

这两个脚本都会:

- 先校验 v3 JSON 是否符合裁剪规则
- 用 `fitz` 按指定 `dpi` 只渲染包含目标 block 的页面
- 将整页 PNG 保存到 `_pages/`
- 根据 v3 JSON 中的 block `bbox/xyxy` 和 page 坐标裁剪出目标区域
- 输出带裁剪元数据的 `manifest.json`

示例:

```bash
python scripts/extract_v3_figures.py \
--pdf /path/to/input.pdf \
--v3-json /path/to/input_v3.json \
--dpi 200 \
--output-dir ./Output/figures
```

```bash
python scripts/extract_v3_tables.py \
--pdf /path/to/input.pdf \
--v3-json /path/to/input_v3.json \
--dpi 200 \
--output-dir ./Output/tables
```

你也可以直接 import 这些工具函数:

```python
from pdfdeal import extract_v3_figure_images, extract_v3_table_images

figure_summary = extract_v3_figure_images(
pdf_path="/path/to/input.pdf",
v3_json_path="/path/to/input_v3.json",
dpi=200,
output_dir="./Output/figures",
)
table_summary = extract_v3_table_images(
pdf_path="/path/to/input.pdf",
v3_json_path="/path/to/input_v3.json",
dpi=200,
output_dir="./Output/tables",
)
print(figure_summary["crop_count"], figure_summary["manifest_path"])
print(table_summary["crop_count"], table_summary["manifest_path"])
```

更多详细请参见在线文档。
40 changes: 28 additions & 12 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,22 +1,37 @@
[project]
name = "pdfdeal"
version = "1.0.2"
authors = [{ name = "Menghuan1918", email = "menghuan@menghuan1918.com" }]
description = "A python wrapper for the Doc2X API and comes with native texts processing (to improve texts recall in RAG)."
version = "1.0.4"
authors = [{ name = "noedgeai", email = "support@noedgeai.com" }]
description = "Python SDK for Doc2X API and some native texts processing (to improve texts recall in RAG)."
readme = "README.md"
requires-python = ">=3.8"
requires-python = ">=3.9"
classifiers = [
"Programming Language :: Python :: 3",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
]
dependencies = ["httpx[http2]>=0.23.1, <1", "pypdf"]
dependencies = [
"aiofiles>=24.1.0",
"cryptography>=46.0.5",
"h2>=4.3.0",
"httpx[http2]>=0.23.1, <1",
"pypdf>=6.8.0",
"pytest>=8.3.5",
"urllib3>=2.6.3",
]

[project.optional-dependencies]
tools = ["emoji", "Pillow", "reportlab", "beautifulsoup4"]
tools = [
"emoji",
"Pillow>=12.1.1; python_version>='3.10'",
"Pillow>=10.4.0,<12.0.0; python_version<'3.10'",
"reportlab",
"beautifulsoup4",
]
rag = [
"emoji",
"Pillow",
"Pillow>=12.1.1; python_version>='3.10'",
"Pillow>=10.4.0,<12.0.0; python_version<'3.10'",
"reportlab",
"oss2",
"boto3",
Expand All @@ -26,7 +41,8 @@ rag = [
dev = [
"pytest",
"emoji",
"Pillow",
"Pillow>=12.1.1; python_version>='3.10'",
"Pillow>=10.4.0,<12.0.0; python_version<'3.10'",
"reportlab",
"oss2",
"boto3",
Expand All @@ -35,10 +51,10 @@ dev = [
]

[project.urls]
Issues = "https://github.com/Menghuan1918/pdfdeal/issues"
Documentation = "https://menghuan1918.github.io/pdfdeal-docs/"
Source = "https://github.com/Menghuan1918/pdfdeal"
Changelog = "https://menghuan1918.github.io/pdfdeal-docs/changes/"
Issues = "https://github.com/NoEdgeAI/pdfdeal/issues"
Documentation = "https://noedgeai.github.io/pdfdeal-docs"
Source = "https://github.com/NoEdgeAI/pdfdeal"
Changelog = "https://noedgeai.github.io/pdfdeal-docs/changes"

[project.scripts]
doc2x = "pdfdeal.CLI.doc2x:main"
Expand Down
14 changes: 14 additions & 0 deletions scripts/extract_v3_figures.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/usr/bin/env python3
from pathlib import Path
import sys

try:
from pdfdeal.v3_media import run_cli
except ImportError: # pragma: no cover - local repo execution fallback
sys.modules.pop("pdfdeal", None)
sys.path.insert(0, str(Path(__file__).resolve().parents[1] / "src"))
from pdfdeal.v3_media import run_cli


if __name__ == "__main__":
raise SystemExit(run_cli("figure"))
14 changes: 14 additions & 0 deletions scripts/extract_v3_tables.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/usr/bin/env python3
from pathlib import Path
import sys

try:
from pdfdeal.v3_media import run_cli
except ImportError: # pragma: no cover - local repo execution fallback
sys.modules.pop("pdfdeal", None)
sys.path.insert(0, str(Path(__file__).resolve().parents[1] / "src"))
from pdfdeal.v3_media import run_cli


if __name__ == "__main__":
raise SystemExit(run_cli("table"))
23 changes: 23 additions & 0 deletions src/pdfdeal/CLI/doc2x.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import argparse
import os
from pdfdeal import Doc2X
from pdfdeal.Doc2X.Types import FormulaLevel, V2ParseModel


def main():
Expand Down Expand Up @@ -30,6 +31,26 @@ def main():
help="The maximum number of pages to process at same time, default is 1000, DO NOT set if you don't know",
required=False,
)
parser.add_argument(
"--model",
help='Upload model for v2 preupload API, e.g. "v3-2026". Leave empty to use server default v2.',
required=False,
choices=[model.value for model in V2ParseModel],
)
parser.add_argument(
"--formula_level",
help=(
'Formula degradation level for v2 export body. '
'0 (default, recommended)=keep original formulas; '
'1=degrade inline formulas (\\(...\\), $...$); '
'2=degrade all formulas including block formulas (\\[...\\], $$...$$). '
'Only effective when --model is "v3-2026".'
),
required=False,
type=int,
choices=[level.value for level in FormulaLevel],
default=FormulaLevel.KEEP_MARKDOWN.value,
)
parser.add_argument(
"-o",
"--output",
Expand Down Expand Up @@ -99,6 +120,8 @@ def main():
pdf_file=filename,
output_path=output,
output_format=format,
model=args.model,
formula_level=args.formula_level,
)

for file in success:
Expand Down
Loading
Loading