-
Notifications
You must be signed in to change notification settings - Fork 7
Use web2json agent to clean html for v2 project #88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
dreamGirl1996
wants to merge
2
commits into
ccprocessor:main
Choose a base branch
from
dreamGirl1996:user/luqing
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,195 @@ | ||
| # web2json-agent Jupyter Guide | ||
|
|
||
| 这个文档专门基于 [scripts/run_jsonl_web2json_pipeline.py](/Users/luqing/Downloads/multiModal/web2json-agent/scripts/run_jsonl_web2json_pipeline.py) 来写,目标是在 Jupyter 里直接跑完整 `jsonl -> html -> classify -> schema -> code -> data` 流水线。 | ||
|
|
||
| 它不覆盖项目原始 [README.md](/Users/luqing/Downloads/multiModal/web2json-agent/README.md)。 | ||
|
|
||
| ## 这份文档对应哪条执行链路 | ||
|
|
||
| 这里用的不是最简单的 `extract_data(...)` 单接口方案,而是项目里的完整脚本流水线: | ||
|
|
||
| - 入口脚本: [scripts/run_jsonl_web2json_pipeline.py](/Users/luqing/Downloads/multiModal/web2json-agent/scripts/run_jsonl_web2json_pipeline.py) | ||
| - Jupyter 包装: [jupyter_helper.py](/Users/luqing/Downloads/multiModal/web2json-agent/jupyter_helper.py) | ||
| - Notebook helper 实现: [notebooks/jupyter_helper.py](/Users/luqing/Downloads/multiModal/web2json-agent/notebooks/jupyter_helper.py) | ||
| - 示例 notebook: [notebooks/web2json_quickstart.ipynb](/Users/luqing/Downloads/multiModal/web2json-agent/notebooks/web2json_quickstart.ipynb) | ||
|
|
||
| ## 流水线做了什么 | ||
|
|
||
| 脚本会按下面顺序执行: | ||
|
|
||
| 1. 读取 `jsonl` | ||
| 2. 从每条记录里取出 `html` 字段 | ||
| 3. 拆成一批 `.html` 文件,并生成 `manifest.jsonl` | ||
| 4. 对 HTML 做 `classify_html_dir` | ||
| 5. 对每个 cluster 执行 `extract_schema` | ||
| 6. 执行 `infer_code` | ||
| 7. 用生成的 parser 执行 `extract_data_with_code` | ||
| 8. 输出 `pipeline_summary.json` | ||
|
|
||
| 适合这种输入数据: | ||
|
|
||
| - 原始数据是 `jsonl` | ||
| - 每行是一条网页记录 | ||
| - 每条记录里有 `html` 字段 | ||
| - 可能还带 `url`、`track_id`、`status` | ||
|
|
||
| ## Jupyter 最短路径 | ||
|
|
||
| ### 1. 进入项目目录 | ||
|
|
||
| ```bash | ||
| cd /Users/luqing/Downloads/multiModal/web2json-agent | ||
| ``` | ||
|
|
||
| ### 2. 安装项目 | ||
|
|
||
| 请显式使用 `python3.11`,不要用系统默认的旧版 `python3`。 | ||
|
|
||
| ```bash | ||
| python3.11 -m pip install . | ||
| ``` | ||
|
|
||
| ### 3. 启动 Jupyter | ||
|
|
||
| ```bash | ||
| python3.11 -m notebook | ||
| ``` | ||
|
|
||
| 或者: | ||
|
|
||
| ```bash | ||
| python3.11 -m jupyter lab | ||
| ``` | ||
|
|
||
| ### 4. 打开示例 notebook | ||
|
|
||
| 打开: | ||
|
|
||
| `notebooks/web2json_quickstart.ipynb` | ||
|
|
||
| ## Notebook 最小示例 | ||
|
|
||
| ### Cell 1: 初始化环境 | ||
|
|
||
| ```python | ||
| from jupyter_helper import prepare_notebook | ||
|
|
||
| prepare_notebook( | ||
| api_key="YOUR_API_KEY", | ||
| api_base="https://api.openai.com/v1", | ||
| ) | ||
| ``` | ||
|
|
||
| ### Cell 2: 运行完整 JSONL pipeline | ||
|
|
||
| ```python | ||
| from jupyter_helper import run_jsonl_pipeline, summarize_pipeline_result | ||
|
|
||
| result = run_jsonl_pipeline( | ||
| source_jsonl="ToClassify/sample.json", | ||
| work_id="sample_run", | ||
| input_root="input_html", | ||
| output_root="output", | ||
| html_key="html", | ||
| iteration_rounds=3, | ||
| cluster_limit=1, | ||
| ) | ||
|
|
||
| summarize_pipeline_result(result) | ||
| ``` | ||
|
|
||
| ### Cell 3: 查看完整结果 | ||
|
|
||
| ```python | ||
| result.to_dict() | ||
| ``` | ||
|
|
||
| ## 也可以直接调用原脚本 | ||
|
|
||
| 如果你不想通过 helper,也可以在 notebook 里直接 import 原脚本里的函数: | ||
|
|
||
| ```python | ||
| from scripts.run_jsonl_web2json_pipeline import run_jsonl_pipeline | ||
|
|
||
| result = run_jsonl_pipeline( | ||
| source_jsonl="ToClassify/sample.json", | ||
| work_id="sample_run", | ||
| ) | ||
| ``` | ||
|
|
||
| 这就是 [scripts/run_jsonl_web2json_pipeline.py](/Users/luqing/Downloads/multiModal/web2json-agent/scripts/run_jsonl_web2json_pipeline.py) 里新增的 notebook-friendly 入口。 | ||
|
|
||
| ## 参数说明 | ||
|
|
||
| `run_jsonl_pipeline(...)` 主要参数: | ||
|
|
||
| - `source_jsonl`: 源 `jsonl` 路径 | ||
| - `work_id`: 这次运行的标识;为空时按文件名自动生成 | ||
| - `input_root`: 拆分后 HTML 的输出根目录,默认 `input_html` | ||
| - `output_root`: pipeline 输出根目录,默认 `output` | ||
| - `html_key`: `jsonl` 中 HTML 字段名,默认 `html` | ||
| - `iteration_rounds`: schema 学习轮数上限,默认 `3` | ||
| - `cluster_limit`: 最多处理多少个 cluster,默认 `0`,表示全部 | ||
|
|
||
| ## 结果会落到哪里 | ||
|
|
||
| 如果你设置: | ||
|
|
||
| ```python | ||
| result = run_jsonl_pipeline( | ||
| source_jsonl="ToClassify/sample.json", | ||
| work_id="sample_run", | ||
| ) | ||
| ``` | ||
|
|
||
| 通常会生成: | ||
|
|
||
| - `input_html/sample_run/` | ||
| - `output/sample_run_pipeline/` | ||
| - `output/sample_run_pipeline/pipeline_summary.json` | ||
|
|
||
| 每个 cluster 下面还会有: | ||
|
|
||
| - schema 输出目录 | ||
| - code 输出目录 | ||
| - data 输出目录 | ||
| - 最终 parser 文件 | ||
|
|
||
| ## API Key 配置 | ||
|
|
||
| 你可以二选一: | ||
|
|
||
| ### 方式 A: 在 notebook 里设置 | ||
|
|
||
| ```python | ||
| from jupyter_helper import prepare_notebook | ||
|
|
||
| prepare_notebook( | ||
| api_key="YOUR_API_KEY", | ||
| api_base="https://api.openai.com/v1", | ||
| ) | ||
| ``` | ||
|
|
||
| ### 方式 B: 在项目根目录放 `.env` | ||
|
|
||
| ```env | ||
| OPENAI_API_KEY=YOUR_API_KEY | ||
| OPENAI_API_BASE=https://api.openai.com/v1 | ||
| DEFAULT_MODEL=gpt-4.1 | ||
| ``` | ||
|
|
||
| ## 已知前提 | ||
|
|
||
| - Python 要求 `>= 3.10` | ||
| - 当前这台机器上默认 `python3` 是旧的 `3.7.3` | ||
| - 建议始终显式使用 `python3.11` | ||
| - 这条流水线依赖模型 API,可用前需要配置好 key/base | ||
|
|
||
| ## 相关文件 | ||
|
|
||
| - [README.md](/Users/luqing/Downloads/multiModal/web2json-agent/README.md) | ||
| - [README_JUPYTER.md](/Users/luqing/Downloads/multiModal/web2json-agent/README_JUPYTER.md) | ||
| - [scripts/run_jsonl_web2json_pipeline.py](/Users/luqing/Downloads/multiModal/web2json-agent/scripts/run_jsonl_web2json_pipeline.py) | ||
| - [jupyter_helper.py](/Users/luqing/Downloads/multiModal/web2json-agent/jupyter_helper.py) | ||
| - [notebooks/jupyter_helper.py](/Users/luqing/Downloads/multiModal/web2json-agent/notebooks/jupyter_helper.py) | ||
| - [notebooks/web2json_quickstart.ipynb](/Users/luqing/Downloads/multiModal/web2json-agent/notebooks/web2json_quickstart.ipynb) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| """Compatibility wrapper so notebooks can import jupyter_helper from multiple locations.""" | ||
|
|
||
| from notebooks.jupyter_helper import * # noqa: F401,F403 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,136 @@ | ||
| """Utilities for running web2json-agent inside Jupyter notebooks.""" | ||
|
|
||
| from __future__ import annotations | ||
|
|
||
| import json | ||
| import os | ||
| import sys | ||
| from pathlib import Path | ||
| from typing import Any, Optional, Sequence | ||
|
|
||
| PROJECT_ROOT = Path(__file__).resolve().parents[1] | ||
|
|
||
|
|
||
| def prepare_notebook( | ||
| api_key: Optional[str] = None, | ||
| api_base: Optional[str] = None, | ||
| project_root: Optional[str] = None, | ||
| ) -> Path: | ||
| """Prepare the notebook process for local package imports and env loading.""" | ||
| root = Path(project_root).expanduser().resolve() if project_root else PROJECT_ROOT | ||
|
|
||
| if str(root) not in sys.path: | ||
| sys.path.insert(0, str(root)) | ||
|
|
||
| os.chdir(root) | ||
|
|
||
| if api_key: | ||
| os.environ["OPENAI_API_KEY"] = api_key | ||
|
|
||
| if api_base: | ||
| os.environ["OPENAI_API_BASE"] = api_base | ||
|
|
||
| return root | ||
|
|
||
|
|
||
| def make_extract_config( | ||
| name: str, | ||
| html_path: str, | ||
| output_path: str = "output", | ||
| save: Optional[Sequence[str]] = ("schema", "code", "data"), | ||
| schema: Optional[dict[str, Any]] = None, | ||
| iteration_rounds: int = 3, | ||
| enable_schema_edit: bool = False, | ||
| remove_null_fields: bool = True, | ||
| parser_code: Optional[str] = None, | ||
| ): | ||
| """Build a Web2JsonConfig with notebook-friendly path resolution.""" | ||
| prepare_notebook() | ||
|
|
||
| from web2json import Web2JsonConfig | ||
|
|
||
| html_target = _resolve_project_path(html_path) | ||
| output_target = _resolve_project_path(output_path) | ||
|
|
||
| return Web2JsonConfig( | ||
| name=name, | ||
| html_path=str(html_target), | ||
| output_path=str(output_target), | ||
| iteration_rounds=iteration_rounds, | ||
| schema=schema, | ||
| enable_schema_edit=enable_schema_edit, | ||
| parser_code=parser_code, | ||
| save=list(save) if save is not None else None, | ||
| remove_null_fields=remove_null_fields, | ||
| ) | ||
|
|
||
|
|
||
| def preview_records(records: Sequence[dict[str, Any]], limit: int = 3) -> list[dict[str, Any]]: | ||
| """Return the first few parsed records so a notebook cell renders them directly.""" | ||
| return list(records[:limit]) | ||
|
|
||
|
|
||
| def print_schema(schema: dict[str, Any]) -> None: | ||
| """Pretty print schema content inside notebooks.""" | ||
| print(json.dumps(schema, ensure_ascii=False, indent=2)) | ||
|
|
||
|
|
||
| def summarize_cluster_result(cluster_result: Any) -> dict[str, Any]: | ||
| """Convert a cluster result into a compact notebook-friendly summary.""" | ||
| return { | ||
| "cluster_count": cluster_result.cluster_count, | ||
| "clusters": {name: len(files) for name, files in cluster_result.clusters.items()}, | ||
| "noise_files": len(cluster_result.noise_files), | ||
| } | ||
|
|
||
|
|
||
| def run_jsonl_pipeline( | ||
| source_jsonl: str, | ||
| work_id: str = "", | ||
| input_root: str = "input_html", | ||
| output_root: str = "output", | ||
| html_key: str = "html", | ||
| iteration_rounds: int = 3, | ||
| cluster_limit: int = 0, | ||
| ): | ||
| """Run the full JSONL pipeline from a notebook and return the structured summary.""" | ||
| prepare_notebook() | ||
|
|
||
| from scripts.run_jsonl_web2json_pipeline import run_jsonl_pipeline as _run_jsonl_pipeline | ||
|
|
||
| return _run_jsonl_pipeline( | ||
| source_jsonl=str(_resolve_project_path(source_jsonl)), | ||
| work_id=work_id, | ||
| input_root=str(_resolve_project_path(input_root)), | ||
| output_root=str(_resolve_project_path(output_root)), | ||
| html_key=html_key, | ||
| iteration_rounds=iteration_rounds, | ||
| cluster_limit=cluster_limit, | ||
| ) | ||
|
|
||
|
|
||
| def summarize_pipeline_result(result: Any) -> dict[str, Any]: | ||
| """Build a compact summary view for notebook display.""" | ||
| return { | ||
| "source_jsonl": result.source_jsonl, | ||
| "pipeline_root": result.pipeline_root, | ||
| "cluster_count": result.cluster_count, | ||
| "clusters": [ | ||
| { | ||
| "cluster_name": cluster["cluster_name"], | ||
| "cluster_size": cluster["cluster_size"], | ||
| "parse_success_count": cluster["parse_success_count"], | ||
| "parse_failed_count": cluster["parse_failed_count"], | ||
| } | ||
| for cluster in result.clusters | ||
| ], | ||
| "total_token_usage": result.total_token_usage, | ||
| "summary_path": result.summary_path, | ||
| } | ||
|
|
||
|
|
||
| def _resolve_project_path(path_str: str) -> Path: | ||
| path = Path(path_str).expanduser() | ||
| if path.is_absolute(): | ||
| return path | ||
| return (PROJECT_ROOT / path).resolve() |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
notebooks/ 下的这两个是用于spark上执行的脚本吗
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这两个文件不是给 Spark 用的,是为了在 Jupyter/Notebook 里本
地调试和演示 web2json 流程加的辅助脚本。主要做 notebook 环境初始化、组装配置,
以及从 notebook 里直接调用整条 pipeline。