Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 40 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,15 @@ The dataset is hosted on Hugging Face: [opendatalab/WebMainBench](https://huggin
```python
from huggingface_hub import hf_hub_download

# Full dataset (7,809 samples) — used for ROUGE-N F1 evaluation
hf_hub_download(
repo_id="opendatalab/WebMainBench",
repo_type="dataset",
filename="webmainbench.jsonl",
local_dir="data/",
)

# 545-sample subset — used for Fine-Grained Edit-Distance Metrics evaluation
hf_hub_download(
repo_id="opendatalab/WebMainBench",
repo_type="dataset",
Expand All @@ -148,7 +157,35 @@ hf_hub_download(
)
```

### Configure LLM (Optional)
### ROUGE-N F1 Evaluation (webmainbench.jsonl)

Use the evaluation scripts in the [MinerU-HTML](https://github.com/opendatalab/MinerU-HTML) repository:

```bash
# Clone MinerU-HTML and prepare the full dataset (webmainbench.jsonl)
git clone https://github.com/opendatalab/MinerU-HTML.git
cd MinerU-HTML

# Run evaluation (example for MinerU-HTML extractor)
python eval_baselines.py \
--bench benchmark/webmainbench.jsonl \
--task_dir benchmark_results/mineru_html-html-md \
--extractor_name mineru_html-html-md \
--model_path YOUR_MODEL_PATH \
--default_config gpu

# For CPU-based extractors (e.g. trafilatura, resiliparse, magic-html)
python eval_baselines.py \
--bench benchmark/webmainbench.jsonl \
--task_dir benchmark_results/trafilatura-html-md \
--extractor_name trafilatura-html-md
```

Results are written to `benchmark_results/<extractor>/mean_eval_result.json`. See `run_eval.sh` for a complete multi-extractor example.

### Fine-Grained Edit-Distance Metrics Evaluation (WebMainBench_545.jsonl)

#### Configure LLM (Optional)

LLM-enhanced content splitting improves formula/table/code extraction accuracy. To enable it, copy `.env.example` to `.env` and fill in your API credentials:

Expand All @@ -157,7 +194,7 @@ cp .env.example .env
# Edit .env and set LLM_BASE_URL, LLM_API_KEY, LLM_MODEL
```

### Run an Evaluation
#### Run an Evaluation

```python
from webmainbench import DataLoader, Evaluator, ExtractorFactory
Expand All @@ -170,7 +207,7 @@ m = result.overall_metrics
print(f"Overall Score: {result.overall_metrics['overall']:.4f}")
```

### Compare Multiple Extractors
#### Compare Multiple Extractors

```python
extractors = ["trafilatura", "resiliparse", "magic-html"]
Expand Down
43 changes: 40 additions & 3 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,15 @@ pip install -e .
```python
from huggingface_hub import hf_hub_download

# 全量数据集(7,809 条)— 用于 ROUGE-N F1 评测
hf_hub_download(
repo_id="opendatalab/WebMainBench",
repo_type="dataset",
filename="webmainbench.jsonl",
local_dir="data/",
)

# 545 条样本子集 — 用于细粒度编辑距离指标评测
hf_hub_download(
repo_id="opendatalab/WebMainBench",
repo_type="dataset",
Expand All @@ -148,7 +157,35 @@ hf_hub_download(
)
```

### 配置 LLM(可选)
### ROUGE-N F1 评测(webmainbench.jsonl)

使用 [MinerU-HTML](https://github.com/opendatalab/MinerU-HTML) 仓库中的评测脚本:

```bash
# 克隆 MinerU-HTML 并准备全量数据集(webmainbench.jsonl)
git clone https://github.com/opendatalab/MinerU-HTML.git
cd MinerU-HTML

# 运行评测(以 MinerU-HTML 抽取器为例)
python eval_baselines.py \
--bench benchmark/webmainbench.jsonl \
--task_dir benchmark_results/mineru_html-html-md \
--extractor_name mineru_html-html-md \
--model_path YOUR_MODEL_PATH \
--default_config gpu

# 对于基于 CPU 的抽取器(如 trafilatura、resiliparse、magic-html)
python eval_baselines.py \
--bench benchmark/webmainbench.jsonl \
--task_dir benchmark_results/trafilatura-html-md \
--extractor_name trafilatura-html-md
```

结果写入 `benchmark_results/<extractor>/mean_eval_result.json`。完整的多抽取器示例见 `run_eval.sh`。

### 细粒度编辑距离指标评测(WebMainBench_545.jsonl)

#### 配置 LLM(可选)

LLM 增强内容拆分可提升公式/表格/代码的抽取精度。如需启用,将 `.env.example` 复制为 `.env` 并填写 API 信息:

Expand All @@ -157,7 +194,7 @@ cp .env.example .env
# 编辑 .env,设置 LLM_BASE_URL、LLM_API_KEY、LLM_MODEL
```

### 运行评测
#### 运行评测

```python
from webmainbench import DataLoader, Evaluator, ExtractorFactory
Expand All @@ -170,7 +207,7 @@ m = result.overall_metrics
print(f"Overall Score: {result.overall_metrics['overall']:.4f}")
```

### 多抽取器对比
#### 多抽取器对比

```python
extractors = ["trafilatura", "resiliparse", "magic-html"]
Expand Down
Loading