opendatalab · e06084 · Apr 3, 2026 · Apr 3, 2026 · Apr 3, 2026 · Apr 3, 2026
diff --git a/README.md b/README.md
@@ -140,6 +140,15 @@ The dataset is hosted on Hugging Face: [opendatalab/WebMainBench](https://huggin
 ```python
 from huggingface_hub import hf_hub_download
 
+# Full dataset (7,809 samples) — used for ROUGE-N F1 evaluation
+hf_hub_download(
+    repo_id="opendatalab/WebMainBench",
+    repo_type="dataset",
+    filename="webmainbench.jsonl",
+    local_dir="data/",
+)
+
+# 545-sample subset — used for Fine-Grained Edit-Distance Metrics evaluation
 hf_hub_download(
     repo_id="opendatalab/WebMainBench",
     repo_type="dataset",
@@ -148,7 +157,35 @@ hf_hub_download(
 )
 ```
 
-### Configure LLM (Optional)
+### ROUGE-N F1 Evaluation (webmainbench.jsonl)
+
+Use the evaluation scripts in the [MinerU-HTML](https://github.com/opendatalab/MinerU-HTML) repository:
+
+```bash
+# Clone MinerU-HTML and prepare the full dataset (webmainbench.jsonl)
+git clone https://github.com/opendatalab/MinerU-HTML.git
+cd MinerU-HTML
+
+# Run evaluation (example for MinerU-HTML extractor)
+python eval_baselines.py \
+    --bench benchmark/webmainbench.jsonl \
+    --task_dir benchmark_results/mineru_html-html-md \
+    --extractor_name mineru_html-html-md \
+    --model_path YOUR_MODEL_PATH \
+    --default_config gpu
+
+# For CPU-based extractors (e.g. trafilatura, resiliparse, magic-html)
+python eval_baselines.py \
+    --bench benchmark/webmainbench.jsonl \
+    --task_dir benchmark_results/trafilatura-html-md \
+    --extractor_name trafilatura-html-md
+```
+
+Results are written to `benchmark_results/<extractor>/mean_eval_result.json`. See `run_eval.sh` for a complete multi-extractor example.
+
+### Fine-Grained Edit-Distance Metrics Evaluation (WebMainBench_545.jsonl)
+
+#### Configure LLM (Optional)
 
 LLM-enhanced content splitting improves formula/table/code extraction accuracy. To enable it, copy `.env.example` to `.env` and fill in your API credentials:
 
@@ -157,7 +194,7 @@ cp .env.example .env
 # Edit .env and set LLM_BASE_URL, LLM_API_KEY, LLM_MODEL
 ```
 
-### Run an Evaluation
+#### Run an Evaluation
 
 ```python
 from webmainbench import DataLoader, Evaluator, ExtractorFactory
@@ -170,7 +207,7 @@ m = result.overall_metrics
 print(f"Overall Score: {result.overall_metrics['overall']:.4f}")
 ```
 
-### Compare Multiple Extractors
+#### Compare Multiple Extractors
 
 ```python
 extractors = ["trafilatura", "resiliparse", "magic-html"]

diff --git a/README_zh.md b/README_zh.md
@@ -140,6 +140,15 @@ pip install -e .
 ```python
 from huggingface_hub import hf_hub_download
 
+# 全量数据集（7,809 条）— 用于 ROUGE-N F1 评测
+hf_hub_download(
+    repo_id="opendatalab/WebMainBench",
+    repo_type="dataset",
+    filename="webmainbench.jsonl",
+    local_dir="data/",
+)
+
+# 545 条样本子集 — 用于细粒度编辑距离指标评测
 hf_hub_download(
     repo_id="opendatalab/WebMainBench",
     repo_type="dataset",
@@ -148,7 +157,35 @@ hf_hub_download(
 )
 ```
 
-### 配置 LLM（可选）
+### ROUGE-N F1 评测（webmainbench.jsonl）
+
+使用 [MinerU-HTML](https://github.com/opendatalab/MinerU-HTML) 仓库中的评测脚本：
+
+```bash
+# 克隆 MinerU-HTML 并准备全量数据集（webmainbench.jsonl）
+git clone https://github.com/opendatalab/MinerU-HTML.git
+cd MinerU-HTML
+
+# 运行评测（以 MinerU-HTML 抽取器为例）
+python eval_baselines.py \
+    --bench benchmark/webmainbench.jsonl \
+    --task_dir benchmark_results/mineru_html-html-md \
+    --extractor_name mineru_html-html-md \
+    --model_path YOUR_MODEL_PATH \
+    --default_config gpu
+
+# 对于基于 CPU 的抽取器（如 trafilatura、resiliparse、magic-html）
+python eval_baselines.py \
+    --bench benchmark/webmainbench.jsonl \
+    --task_dir benchmark_results/trafilatura-html-md \
+    --extractor_name trafilatura-html-md
+```
+
+结果写入 `benchmark_results/<extractor>/mean_eval_result.json`。完整的多抽取器示例见 `run_eval.sh`。
+
+### 细粒度编辑距离指标评测（WebMainBench_545.jsonl）
+
+#### 配置 LLM（可选）
 
 LLM 增强内容拆分可提升公式/表格/代码的抽取精度。如需启用，将 `.env.example` 复制为 `.env` 并填写 API 信息：
 
@@ -157,7 +194,7 @@ cp .env.example .env
 # 编辑 .env，设置 LLM_BASE_URL、LLM_API_KEY、LLM_MODEL
 ```
 
-### 运行评测
+#### 运行评测
 
 ```python
 from webmainbench import DataLoader, Evaluator, ExtractorFactory
@@ -170,7 +207,7 @@ m = result.overall_metrics
 print(f"Overall Score: {result.overall_metrics['overall']:.4f}")
 ```
 
-### 多抽取器对比
+#### 多抽取器对比
 
 ```python
 extractors = ["trafilatura", "resiliparse", "magic-html"]