From a513bffebf49c48085506665af292239aba3735a Mon Sep 17 00:00:00 2001 From: brown <1041206149@qq.com> Date: Fri, 3 Apr 2026 14:26:10 +0800 Subject: [PATCH 1/4] docs: Update README --- README.md | 24 ++++++++++++++++++++++++ README_zh.md | 24 ++++++++++++++++++++++++ 2 files changed, 48 insertions(+) diff --git a/README.md b/README.md index 56e0bf7..9aad153 100644 --- a/README.md +++ b/README.md @@ -92,6 +92,30 @@ All scores are in **[0, 1]**; higher is better. ### ROUGE-N F1 on Full Dataset (7,809 samples) +**How to reproduce:** Use the evaluation scripts in the [MinerU-HTML](https://github.com/opendatalab/MinerU-HTML) repository: + +```bash +# Clone MinerU-HTML and prepare the full dataset (WebMainBench_7809.jsonl) +git clone https://github.com/opendatalab/MinerU-HTML.git +cd MinerU-HTML + +# Run evaluation (example for MinerU-HTML extractor) +python eval_baselines.py \ + --bench benchmark/WebMainBench_7809.jsonl \ + --task_dir benchmark_results/mineru_html-html-md \ + --extractor_name mineru_html-html-md \ + --model_path YOUR_MODEL_PATH \ + --default_config gpu + +# For CPU-based extractors (e.g. trafilatura, resiliparse, magic-html) +python eval_baselines.py \ + --bench benchmark/WebMainBench_7809.jsonl \ + --task_dir benchmark_results/trafilatura-html-md \ + --extractor_name trafilatura-html-md +``` + +Results are written to `benchmark_results//mean_eval_result.json`. See `run_eval.sh` for a complete multi-extractor example. + Results from the [Dripper paper](https://arxiv.org/abs/2511.23119) (Table 2): | Extractor | Mode | All | Simple | Mid | Hard | diff --git a/README_zh.md b/README_zh.md index 19e225c..f83582e 100644 --- a/README_zh.md +++ b/README_zh.md @@ -92,6 +92,30 @@ WebMainBench 支持两套互补的评测协议: ### ROUGE-N F1 — 全量数据集(7,809 条) +**复现方法:** 使用 [MinerU-HTML](https://github.com/opendatalab/MinerU-HTML) 仓库中的评测脚本: + +```bash +# 克隆 MinerU-HTML 并准备全量数据集(WebMainBench_7809.jsonl) +git clone https://github.com/opendatalab/MinerU-HTML.git +cd MinerU-HTML + +# 运行评测(以 MinerU-HTML 抽取器为例) +python eval_baselines.py \ + --bench benchmark/WebMainBench_7809.jsonl \ + --task_dir benchmark_results/mineru_html-html-md \ + --extractor_name mineru_html-html-md \ + --model_path YOUR_MODEL_PATH \ + --default_config gpu + +# 对于基于 CPU 的抽取器(如 trafilatura、resiliparse、magic-html) +python eval_baselines.py \ + --bench benchmark/WebMainBench_7809.jsonl \ + --task_dir benchmark_results/trafilatura-html-md \ + --extractor_name trafilatura-html-md +``` + +结果写入 `benchmark_results//mean_eval_result.json`。完整的多抽取器示例见 `run_eval.sh`。 + 来自 [Dripper 论文](https://arxiv.org/abs/2511.23119)(表 2): | 抽取器 | 模式 | All | Simple | Mid | Hard | From 07cf3dfffe125750555af11b11fba166263bd902 Mon Sep 17 00:00:00 2001 From: brown <1041206149@qq.com> Date: Fri, 3 Apr 2026 17:44:22 +0800 Subject: [PATCH 2/4] docs: Update README --- README.md | 2 +- README_zh.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 9aad153..1a8fe18 100644 --- a/README.md +++ b/README.md @@ -92,7 +92,7 @@ All scores are in **[0, 1]**; higher is better. ### ROUGE-N F1 on Full Dataset (7,809 samples) -**How to reproduce:** Use the evaluation scripts in the [MinerU-HTML](https://github.com/opendatalab/MinerU-HTML) repository: +**Execution Method:** Use the evaluation scripts in the [MinerU-HTML](https://github.com/opendatalab/MinerU-HTML) repository: ```bash # Clone MinerU-HTML and prepare the full dataset (WebMainBench_7809.jsonl) diff --git a/README_zh.md b/README_zh.md index f83582e..911c283 100644 --- a/README_zh.md +++ b/README_zh.md @@ -92,7 +92,7 @@ WebMainBench 支持两套互补的评测协议: ### ROUGE-N F1 — 全量数据集(7,809 条) -**复现方法:** 使用 [MinerU-HTML](https://github.com/opendatalab/MinerU-HTML) 仓库中的评测脚本: +**执行方法:** 使用 [MinerU-HTML](https://github.com/opendatalab/MinerU-HTML) 仓库中的评测脚本: ```bash # 克隆 MinerU-HTML 并准备全量数据集(WebMainBench_7809.jsonl) From 64a81ee91eac1cd068699435c4388a3be726999c Mon Sep 17 00:00:00 2001 From: brown <1041206149@qq.com> Date: Fri, 3 Apr 2026 19:14:48 +0800 Subject: [PATCH 3/4] docs: Update README --- README.md | 67 +++++++++++++++++++++++++++++++--------------------- README_zh.md | 67 +++++++++++++++++++++++++++++++--------------------- 2 files changed, 80 insertions(+), 54 deletions(-) diff --git a/README.md b/README.md index 1a8fe18..31a4540 100644 --- a/README.md +++ b/README.md @@ -92,30 +92,6 @@ All scores are in **[0, 1]**; higher is better. ### ROUGE-N F1 on Full Dataset (7,809 samples) -**Execution Method:** Use the evaluation scripts in the [MinerU-HTML](https://github.com/opendatalab/MinerU-HTML) repository: - -```bash -# Clone MinerU-HTML and prepare the full dataset (WebMainBench_7809.jsonl) -git clone https://github.com/opendatalab/MinerU-HTML.git -cd MinerU-HTML - -# Run evaluation (example for MinerU-HTML extractor) -python eval_baselines.py \ - --bench benchmark/WebMainBench_7809.jsonl \ - --task_dir benchmark_results/mineru_html-html-md \ - --extractor_name mineru_html-html-md \ - --model_path YOUR_MODEL_PATH \ - --default_config gpu - -# For CPU-based extractors (e.g. trafilatura, resiliparse, magic-html) -python eval_baselines.py \ - --bench benchmark/WebMainBench_7809.jsonl \ - --task_dir benchmark_results/trafilatura-html-md \ - --extractor_name trafilatura-html-md -``` - -Results are written to `benchmark_results//mean_eval_result.json`. See `run_eval.sh` for a complete multi-extractor example. - Results from the [Dripper paper](https://arxiv.org/abs/2511.23119) (Table 2): | Extractor | Mode | All | Simple | Mid | Hard | @@ -164,6 +140,15 @@ The dataset is hosted on Hugging Face: [opendatalab/WebMainBench](https://huggin ```python from huggingface_hub import hf_hub_download +# Full dataset (7,809 samples) — used for ROUGE-N F1 evaluation +hf_hub_download( + repo_id="opendatalab/WebMainBench", + repo_type="dataset", + filename="WebMainBench_7809.jsonl", + local_dir="data/", +) + +# 545-sample subset — used for Fine-Grained Edit-Distance Metrics evaluation hf_hub_download( repo_id="opendatalab/WebMainBench", repo_type="dataset", @@ -172,7 +157,35 @@ hf_hub_download( ) ``` -### Configure LLM (Optional) +### ROUGE-N F1 Evaluation (WebMainBench_7809.jsonl) + +Use the evaluation scripts in the [MinerU-HTML](https://github.com/opendatalab/MinerU-HTML) repository: + +```bash +# Clone MinerU-HTML and prepare the full dataset (WebMainBench_7809.jsonl) +git clone https://github.com/opendatalab/MinerU-HTML.git +cd MinerU-HTML + +# Run evaluation (example for MinerU-HTML extractor) +python eval_baselines.py \ + --bench benchmark/WebMainBench_7809.jsonl \ + --task_dir benchmark_results/mineru_html-html-md \ + --extractor_name mineru_html-html-md \ + --model_path YOUR_MODEL_PATH \ + --default_config gpu + +# For CPU-based extractors (e.g. trafilatura, resiliparse, magic-html) +python eval_baselines.py \ + --bench benchmark/WebMainBench_7809.jsonl \ + --task_dir benchmark_results/trafilatura-html-md \ + --extractor_name trafilatura-html-md +``` + +Results are written to `benchmark_results//mean_eval_result.json`. See `run_eval.sh` for a complete multi-extractor example. + +### Fine-Grained Edit-Distance Metrics Evaluation (WebMainBench_545.jsonl) + +#### Configure LLM (Optional) LLM-enhanced content splitting improves formula/table/code extraction accuracy. To enable it, copy `.env.example` to `.env` and fill in your API credentials: @@ -181,7 +194,7 @@ cp .env.example .env # Edit .env and set LLM_BASE_URL, LLM_API_KEY, LLM_MODEL ``` -### Run an Evaluation +#### Run an Evaluation ```python from webmainbench import DataLoader, Evaluator, ExtractorFactory @@ -194,7 +207,7 @@ m = result.overall_metrics print(f"Overall Score: {result.overall_metrics['overall']:.4f}") ``` -### Compare Multiple Extractors +#### Compare Multiple Extractors ```python extractors = ["trafilatura", "resiliparse", "magic-html"] diff --git a/README_zh.md b/README_zh.md index 911c283..8092d18 100644 --- a/README_zh.md +++ b/README_zh.md @@ -92,30 +92,6 @@ WebMainBench 支持两套互补的评测协议: ### ROUGE-N F1 — 全量数据集(7,809 条) -**执行方法:** 使用 [MinerU-HTML](https://github.com/opendatalab/MinerU-HTML) 仓库中的评测脚本: - -```bash -# 克隆 MinerU-HTML 并准备全量数据集(WebMainBench_7809.jsonl) -git clone https://github.com/opendatalab/MinerU-HTML.git -cd MinerU-HTML - -# 运行评测(以 MinerU-HTML 抽取器为例) -python eval_baselines.py \ - --bench benchmark/WebMainBench_7809.jsonl \ - --task_dir benchmark_results/mineru_html-html-md \ - --extractor_name mineru_html-html-md \ - --model_path YOUR_MODEL_PATH \ - --default_config gpu - -# 对于基于 CPU 的抽取器(如 trafilatura、resiliparse、magic-html) -python eval_baselines.py \ - --bench benchmark/WebMainBench_7809.jsonl \ - --task_dir benchmark_results/trafilatura-html-md \ - --extractor_name trafilatura-html-md -``` - -结果写入 `benchmark_results//mean_eval_result.json`。完整的多抽取器示例见 `run_eval.sh`。 - 来自 [Dripper 论文](https://arxiv.org/abs/2511.23119)(表 2): | 抽取器 | 模式 | All | Simple | Mid | Hard | @@ -164,6 +140,15 @@ pip install -e . ```python from huggingface_hub import hf_hub_download +# 全量数据集(7,809 条)— 用于 ROUGE-N F1 评测 +hf_hub_download( + repo_id="opendatalab/WebMainBench", + repo_type="dataset", + filename="WebMainBench_7809.jsonl", + local_dir="data/", +) + +# 545 条样本子集 — 用于细粒度编辑距离指标评测 hf_hub_download( repo_id="opendatalab/WebMainBench", repo_type="dataset", @@ -172,7 +157,35 @@ hf_hub_download( ) ``` -### 配置 LLM(可选) +### ROUGE-N F1 评测(WebMainBench_7809.jsonl) + +使用 [MinerU-HTML](https://github.com/opendatalab/MinerU-HTML) 仓库中的评测脚本: + +```bash +# 克隆 MinerU-HTML 并准备全量数据集(WebMainBench_7809.jsonl) +git clone https://github.com/opendatalab/MinerU-HTML.git +cd MinerU-HTML + +# 运行评测(以 MinerU-HTML 抽取器为例) +python eval_baselines.py \ + --bench benchmark/WebMainBench_7809.jsonl \ + --task_dir benchmark_results/mineru_html-html-md \ + --extractor_name mineru_html-html-md \ + --model_path YOUR_MODEL_PATH \ + --default_config gpu + +# 对于基于 CPU 的抽取器(如 trafilatura、resiliparse、magic-html) +python eval_baselines.py \ + --bench benchmark/WebMainBench_7809.jsonl \ + --task_dir benchmark_results/trafilatura-html-md \ + --extractor_name trafilatura-html-md +``` + +结果写入 `benchmark_results//mean_eval_result.json`。完整的多抽取器示例见 `run_eval.sh`。 + +### 细粒度编辑距离指标评测(WebMainBench_545.jsonl) + +#### 配置 LLM(可选) LLM 增强内容拆分可提升公式/表格/代码的抽取精度。如需启用,将 `.env.example` 复制为 `.env` 并填写 API 信息: @@ -181,7 +194,7 @@ cp .env.example .env # 编辑 .env,设置 LLM_BASE_URL、LLM_API_KEY、LLM_MODEL ``` -### 运行评测 +#### 运行评测 ```python from webmainbench import DataLoader, Evaluator, ExtractorFactory @@ -194,7 +207,7 @@ m = result.overall_metrics print(f"Overall Score: {result.overall_metrics['overall']:.4f}") ``` -### 多抽取器对比 +#### 多抽取器对比 ```python extractors = ["trafilatura", "resiliparse", "magic-html"] From 3b617f3fce84fa907c774ddbe9fb64478edce343 Mon Sep 17 00:00:00 2001 From: brown <1041206149@qq.com> Date: Fri, 3 Apr 2026 19:20:17 +0800 Subject: [PATCH 4/4] docs: Update README --- README.md | 10 +++++----- README_zh.md | 10 +++++----- 2 files changed, 10 insertions(+), 10 deletions(-) diff --git a/README.md b/README.md index 31a4540..71d6779 100644 --- a/README.md +++ b/README.md @@ -144,7 +144,7 @@ from huggingface_hub import hf_hub_download hf_hub_download( repo_id="opendatalab/WebMainBench", repo_type="dataset", - filename="WebMainBench_7809.jsonl", + filename="webmainbench.jsonl", local_dir="data/", ) @@ -157,18 +157,18 @@ hf_hub_download( ) ``` -### ROUGE-N F1 Evaluation (WebMainBench_7809.jsonl) +### ROUGE-N F1 Evaluation (webmainbench.jsonl) Use the evaluation scripts in the [MinerU-HTML](https://github.com/opendatalab/MinerU-HTML) repository: ```bash -# Clone MinerU-HTML and prepare the full dataset (WebMainBench_7809.jsonl) +# Clone MinerU-HTML and prepare the full dataset (webmainbench.jsonl) git clone https://github.com/opendatalab/MinerU-HTML.git cd MinerU-HTML # Run evaluation (example for MinerU-HTML extractor) python eval_baselines.py \ - --bench benchmark/WebMainBench_7809.jsonl \ + --bench benchmark/webmainbench.jsonl \ --task_dir benchmark_results/mineru_html-html-md \ --extractor_name mineru_html-html-md \ --model_path YOUR_MODEL_PATH \ @@ -176,7 +176,7 @@ python eval_baselines.py \ # For CPU-based extractors (e.g. trafilatura, resiliparse, magic-html) python eval_baselines.py \ - --bench benchmark/WebMainBench_7809.jsonl \ + --bench benchmark/webmainbench.jsonl \ --task_dir benchmark_results/trafilatura-html-md \ --extractor_name trafilatura-html-md ``` diff --git a/README_zh.md b/README_zh.md index 8092d18..0601bdc 100644 --- a/README_zh.md +++ b/README_zh.md @@ -144,7 +144,7 @@ from huggingface_hub import hf_hub_download hf_hub_download( repo_id="opendatalab/WebMainBench", repo_type="dataset", - filename="WebMainBench_7809.jsonl", + filename="webmainbench.jsonl", local_dir="data/", ) @@ -157,18 +157,18 @@ hf_hub_download( ) ``` -### ROUGE-N F1 评测(WebMainBench_7809.jsonl) +### ROUGE-N F1 评测(webmainbench.jsonl) 使用 [MinerU-HTML](https://github.com/opendatalab/MinerU-HTML) 仓库中的评测脚本: ```bash -# 克隆 MinerU-HTML 并准备全量数据集(WebMainBench_7809.jsonl) +# 克隆 MinerU-HTML 并准备全量数据集(webmainbench.jsonl) git clone https://github.com/opendatalab/MinerU-HTML.git cd MinerU-HTML # 运行评测(以 MinerU-HTML 抽取器为例) python eval_baselines.py \ - --bench benchmark/WebMainBench_7809.jsonl \ + --bench benchmark/webmainbench.jsonl \ --task_dir benchmark_results/mineru_html-html-md \ --extractor_name mineru_html-html-md \ --model_path YOUR_MODEL_PATH \ @@ -176,7 +176,7 @@ python eval_baselines.py \ # 对于基于 CPU 的抽取器(如 trafilatura、resiliparse、magic-html) python eval_baselines.py \ - --bench benchmark/WebMainBench_7809.jsonl \ + --bench benchmark/webmainbench.jsonl \ --task_dir benchmark_results/trafilatura-html-md \ --extractor_name trafilatura-html-md ```