Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
195 changes: 195 additions & 0 deletions README_JUPYTER.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
# web2json-agent Jupyter Guide

这个文档专门基于 [scripts/run_jsonl_web2json_pipeline.py](/Users/luqing/Downloads/multiModal/web2json-agent/scripts/run_jsonl_web2json_pipeline.py) 来写,目标是在 Jupyter 里直接跑完整 `jsonl -> html -> classify -> schema -> code -> data` 流水线。

它不覆盖项目原始 [README.md](/Users/luqing/Downloads/multiModal/web2json-agent/README.md)。

## 这份文档对应哪条执行链路

这里用的不是最简单的 `extract_data(...)` 单接口方案,而是项目里的完整脚本流水线:

- 入口脚本: [scripts/run_jsonl_web2json_pipeline.py](/Users/luqing/Downloads/multiModal/web2json-agent/scripts/run_jsonl_web2json_pipeline.py)
- Jupyter 包装: [jupyter_helper.py](/Users/luqing/Downloads/multiModal/web2json-agent/jupyter_helper.py)
- Notebook helper 实现: [notebooks/jupyter_helper.py](/Users/luqing/Downloads/multiModal/web2json-agent/notebooks/jupyter_helper.py)
- 示例 notebook: [notebooks/web2json_quickstart.ipynb](/Users/luqing/Downloads/multiModal/web2json-agent/notebooks/web2json_quickstart.ipynb)

## 流水线做了什么

脚本会按下面顺序执行:

1. 读取 `jsonl`
2. 从每条记录里取出 `html` 字段
3. 拆成一批 `.html` 文件,并生成 `manifest.jsonl`
4. 对 HTML 做 `classify_html_dir`
5. 对每个 cluster 执行 `extract_schema`
6. 执行 `infer_code`
7. 用生成的 parser 执行 `extract_data_with_code`
8. 输出 `pipeline_summary.json`

适合这种输入数据:

- 原始数据是 `jsonl`
- 每行是一条网页记录
- 每条记录里有 `html` 字段
- 可能还带 `url`、`track_id`、`status`

## Jupyter 最短路径

### 1. 进入项目目录

```bash
cd /Users/luqing/Downloads/multiModal/web2json-agent
```

### 2. 安装项目

请显式使用 `python3.11`,不要用系统默认的旧版 `python3`。

```bash
python3.11 -m pip install .
```

### 3. 启动 Jupyter

```bash
python3.11 -m notebook
```

或者:

```bash
python3.11 -m jupyter lab
```

### 4. 打开示例 notebook

打开:

`notebooks/web2json_quickstart.ipynb`

## Notebook 最小示例

### Cell 1: 初始化环境

```python
from jupyter_helper import prepare_notebook

prepare_notebook(
api_key="YOUR_API_KEY",
api_base="https://api.openai.com/v1",
)
```

### Cell 2: 运行完整 JSONL pipeline

```python
from jupyter_helper import run_jsonl_pipeline, summarize_pipeline_result

result = run_jsonl_pipeline(
source_jsonl="ToClassify/sample.json",
work_id="sample_run",
input_root="input_html",
output_root="output",
html_key="html",
iteration_rounds=3,
cluster_limit=1,
)

summarize_pipeline_result(result)
```

### Cell 3: 查看完整结果

```python
result.to_dict()
```

## 也可以直接调用原脚本

如果你不想通过 helper,也可以在 notebook 里直接 import 原脚本里的函数:

```python
from scripts.run_jsonl_web2json_pipeline import run_jsonl_pipeline

result = run_jsonl_pipeline(
source_jsonl="ToClassify/sample.json",
work_id="sample_run",
)
```

这就是 [scripts/run_jsonl_web2json_pipeline.py](/Users/luqing/Downloads/multiModal/web2json-agent/scripts/run_jsonl_web2json_pipeline.py) 里新增的 notebook-friendly 入口。

## 参数说明

`run_jsonl_pipeline(...)` 主要参数:

- `source_jsonl`: 源 `jsonl` 路径
- `work_id`: 这次运行的标识;为空时按文件名自动生成
- `input_root`: 拆分后 HTML 的输出根目录,默认 `input_html`
- `output_root`: pipeline 输出根目录,默认 `output`
- `html_key`: `jsonl` 中 HTML 字段名,默认 `html`
- `iteration_rounds`: schema 学习轮数上限,默认 `3`
- `cluster_limit`: 最多处理多少个 cluster,默认 `0`,表示全部

## 结果会落到哪里

如果你设置:

```python
result = run_jsonl_pipeline(
source_jsonl="ToClassify/sample.json",
work_id="sample_run",
)
```

通常会生成:

- `input_html/sample_run/`
- `output/sample_run_pipeline/`
- `output/sample_run_pipeline/pipeline_summary.json`

每个 cluster 下面还会有:

- schema 输出目录
- code 输出目录
- data 输出目录
- 最终 parser 文件

## API Key 配置

你可以二选一:

### 方式 A: 在 notebook 里设置

```python
from jupyter_helper import prepare_notebook

prepare_notebook(
api_key="YOUR_API_KEY",
api_base="https://api.openai.com/v1",
)
```

### 方式 B: 在项目根目录放 `.env`

```env
OPENAI_API_KEY=YOUR_API_KEY
OPENAI_API_BASE=https://api.openai.com/v1
DEFAULT_MODEL=gpt-4.1
```

## 已知前提

- Python 要求 `>= 3.10`
- 当前这台机器上默认 `python3` 是旧的 `3.7.3`
- 建议始终显式使用 `python3.11`
- 这条流水线依赖模型 API,可用前需要配置好 key/base

## 相关文件

- [README.md](/Users/luqing/Downloads/multiModal/web2json-agent/README.md)
- [README_JUPYTER.md](/Users/luqing/Downloads/multiModal/web2json-agent/README_JUPYTER.md)
- [scripts/run_jsonl_web2json_pipeline.py](/Users/luqing/Downloads/multiModal/web2json-agent/scripts/run_jsonl_web2json_pipeline.py)
- [jupyter_helper.py](/Users/luqing/Downloads/multiModal/web2json-agent/jupyter_helper.py)
- [notebooks/jupyter_helper.py](/Users/luqing/Downloads/multiModal/web2json-agent/notebooks/jupyter_helper.py)
- [notebooks/web2json_quickstart.ipynb](/Users/luqing/Downloads/multiModal/web2json-agent/notebooks/web2json_quickstart.ipynb)
3 changes: 3 additions & 0 deletions jupyter_helper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
"""Compatibility wrapper so notebooks can import jupyter_helper from multiple locations."""

from notebooks.jupyter_helper import * # noqa: F401,F403
136 changes: 136 additions & 0 deletions notebooks/jupyter_helper.py
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notebooks/ 下的这两个是用于spark上执行的脚本吗

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这两个文件不是给 Spark 用的,是为了在 Jupyter/Notebook 里本
地调试和演示 web2json 流程加的辅助脚本。主要做 notebook 环境初始化、组装配置,
以及从 notebook 里直接调用整条 pipeline。

Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
"""Utilities for running web2json-agent inside Jupyter notebooks."""

from __future__ import annotations

import json
import os
import sys
from pathlib import Path
from typing import Any, Optional, Sequence

PROJECT_ROOT = Path(__file__).resolve().parents[1]


def prepare_notebook(
api_key: Optional[str] = None,
api_base: Optional[str] = None,
project_root: Optional[str] = None,
) -> Path:
"""Prepare the notebook process for local package imports and env loading."""
root = Path(project_root).expanduser().resolve() if project_root else PROJECT_ROOT

if str(root) not in sys.path:
sys.path.insert(0, str(root))

os.chdir(root)

if api_key:
os.environ["OPENAI_API_KEY"] = api_key

if api_base:
os.environ["OPENAI_API_BASE"] = api_base

return root


def make_extract_config(
name: str,
html_path: str,
output_path: str = "output",
save: Optional[Sequence[str]] = ("schema", "code", "data"),
schema: Optional[dict[str, Any]] = None,
iteration_rounds: int = 3,
enable_schema_edit: bool = False,
remove_null_fields: bool = True,
parser_code: Optional[str] = None,
):
"""Build a Web2JsonConfig with notebook-friendly path resolution."""
prepare_notebook()

from web2json import Web2JsonConfig

html_target = _resolve_project_path(html_path)
output_target = _resolve_project_path(output_path)

return Web2JsonConfig(
name=name,
html_path=str(html_target),
output_path=str(output_target),
iteration_rounds=iteration_rounds,
schema=schema,
enable_schema_edit=enable_schema_edit,
parser_code=parser_code,
save=list(save) if save is not None else None,
remove_null_fields=remove_null_fields,
)


def preview_records(records: Sequence[dict[str, Any]], limit: int = 3) -> list[dict[str, Any]]:
"""Return the first few parsed records so a notebook cell renders them directly."""
return list(records[:limit])


def print_schema(schema: dict[str, Any]) -> None:
"""Pretty print schema content inside notebooks."""
print(json.dumps(schema, ensure_ascii=False, indent=2))


def summarize_cluster_result(cluster_result: Any) -> dict[str, Any]:
"""Convert a cluster result into a compact notebook-friendly summary."""
return {
"cluster_count": cluster_result.cluster_count,
"clusters": {name: len(files) for name, files in cluster_result.clusters.items()},
"noise_files": len(cluster_result.noise_files),
}


def run_jsonl_pipeline(
source_jsonl: str,
work_id: str = "",
input_root: str = "input_html",
output_root: str = "output",
html_key: str = "html",
iteration_rounds: int = 3,
cluster_limit: int = 0,
):
"""Run the full JSONL pipeline from a notebook and return the structured summary."""
prepare_notebook()

from scripts.run_jsonl_web2json_pipeline import run_jsonl_pipeline as _run_jsonl_pipeline

return _run_jsonl_pipeline(
source_jsonl=str(_resolve_project_path(source_jsonl)),
work_id=work_id,
input_root=str(_resolve_project_path(input_root)),
output_root=str(_resolve_project_path(output_root)),
html_key=html_key,
iteration_rounds=iteration_rounds,
cluster_limit=cluster_limit,
)


def summarize_pipeline_result(result: Any) -> dict[str, Any]:
"""Build a compact summary view for notebook display."""
return {
"source_jsonl": result.source_jsonl,
"pipeline_root": result.pipeline_root,
"cluster_count": result.cluster_count,
"clusters": [
{
"cluster_name": cluster["cluster_name"],
"cluster_size": cluster["cluster_size"],
"parse_success_count": cluster["parse_success_count"],
"parse_failed_count": cluster["parse_failed_count"],
}
for cluster in result.clusters
],
"total_token_usage": result.total_token_usage,
"summary_path": result.summary_path,
}


def _resolve_project_path(path_str: str) -> Path:
path = Path(path_str).expanduser()
if path.is_absolute():
return path
return (PROJECT_ROOT / path).resolve()
Loading