Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
61fbdce
Framework simplified v1.0 (#211)
yogacc33 Feb 21, 2025
29445fa
add zh-en-article quality model (#208)
ideaflow Feb 24, 2025
fd04da5
bugfix: Make CleanExp inherit from WebKitBaseException (#223)
darkrush Feb 24, 2025
83fa499
fix html simplify noscript tag (#222)
feifei2023 Feb 24, 2025
917c466
[CI]: add python3.12&3.11 env (#220)
e06084 Feb 25, 2025
cb8b5fb
resolve nest table (#225)
dt-yy Feb 25, 2025
4765cfb
fix: mml to latex and math extract no full (#219)
lsp213 Feb 25, 2025
9892563
fix: DataJson construct method do not change outer variable (#226)
drunkpig Feb 26, 2025
74f4383
fix: code test case (#221)
NgZiming Feb 26, 2025
c499c58
[fix]: fix math nonetype (#229)
e06084 Feb 26, 2025
0ae2776
add document of quality model (#230)
ideaflow Feb 26, 2025
002f9de
fix:math no text (#233)
lsp213 Feb 27, 2025
4e63cc5
feat: add CleanModule to provide general clean function (#234)
darkrush Feb 27, 2025
6d48387
Add a new interface in political model to accommodate the new require…
ideaflow Feb 27, 2025
211759e
feat: add code and math test html in st (#235)
e06084 Feb 27, 2025
47bad53
[CI]: python3.11 and 3.12 run when requirements modify (#236)
e06084 Feb 27, 2025
840c7bd
Revise quality document,and change english stop words reading method …
ideaflow Feb 27, 2025
6002dfc
feat: add CleanModule to provide general clean function (#246)
darkrush Feb 28, 2025
3f4b8e8
lang_id doc revise (#245)
2471023025 Feb 28, 2025
b095987
feat: add CleanModule to provide general clean function (#232)
darkrush Feb 28, 2025
90cc8a2
fix: move html_simplify_classify.md to docs/llm_web_kit/model (#247)
darkrush Feb 28, 2025
44a105e
Exception refactoring (#249)
yogacc33 Feb 28, 2025
1e97c9e
refactor: remove old clean exception add model-related exceptions (#252)
darkrush Feb 28, 2025
be90fe7
feat: 添加 CleanModule 文档及参数说明,支持内容质量清洗 (#250)
darkrush Feb 28, 2025
da6a8a0
修改语言分类模型文档 (#251)
2471023025 Feb 28, 2025
68537e3
init unsafe_words_detector.py (#194)
Adela-Yu-Coder Feb 28, 2025
68a18be
feat: 添加线程安全的文件下载功能,支持文件哈希校验和锁机制 (#254)
darkrush Feb 28, 2025
fd8e2a0
fix: 修复文件锁定机制,确保锁文件在异常情况下被正确删除 (#256)
darkrush Feb 28, 2025
97c15b2
Dev lid218 (#255)
2471023025 Mar 2, 2025
340f11c
调整解析顺序&更新标准 (#248)
dt-yy Mar 3, 2025
e6806ad
docs: 更新HTML简化分类文档,添加模型下载配置示例,fix: 修改质量模型预测方法,限制线程数为1 (#260)
darkrush Mar 3, 2025
9021522
add test case (#258)
dt-yy Mar 4, 2025
d04b568
Dev lid218 (#261)
2471023025 Mar 4, 2025
87208dc
[fix]: remove class=d-none tag (#268)
e06084 Mar 5, 2025
9841263
update document of quality and political model (#270)
ideaflow Mar 5, 2025
fc2eb62
x (#269)
renpengli01 Mar 5, 2025
0106121
补充表格单元格的tail (#267)
dt-yy Mar 5, 2025
1c498d5
refact: make resource_utils to use project defined exception and ref…
darkrush Mar 6, 2025
620f5e2
[feat]: add CleanTagsPreExtractor (#278)
e06084 Mar 7, 2025
49390e6
解决list和table等问题 (#279)
dt-yy Mar 10, 2025
da8bb2b
add list test case & fix list nest level (#282)
dt-yy Mar 10, 2025
75ad259
修复table的实体标记问题 (#285)
dt-yy Mar 10, 2025
8653649
use SoftFilelock ot ensure model resouce processed correctly (#280)
darkrush Mar 10, 2025
f19583e
fead: html_layout_classify/* 模型分类处理 & html_layout_classify.md
renpengli01 Mar 11, 2025
4a7c92e
fix: html_layout_classify commit
renpengli01 Mar 11, 2025
2671023
Merge pull request #288 from renpengli01/rpl03
drunkpig Mar 11, 2025
d70af66
重构模型导入方式,使用import_transformer函数替代直接导入transformers
darkrush Mar 11, 2025
1772634
重构模型加载方式,使用import_transformer函数替代直接导入transformers
darkrush Mar 11, 2025
8315533
Merge branch 'dev' of github.com:ccprocessor/llm-webkit-mirror into m…
darkrush Mar 11, 2025
522a324
feat: content_list to_dict()
Feb 28, 2025
50fafba
fix: test error
Mar 11, 2025
3d27885
fix: 确保缓存目录和临时目录存在
darkrush Mar 11, 2025
e9b0912
fix: json utils error with different python version
Mar 11, 2025
d73fab9
fix: json utils error with different python version
Mar 11, 2025
83a75c8
fix: datajson.to_dict()
Mar 11, 2025
cb1d052
Merge pull request #290 from darkrush/model_interface
darkrush Mar 11, 2025
203110b
Merge pull request #292 from drunkpig/dev7
drunkpig Mar 11, 2025
9e40494
Exception dynamically set dataset_name (#293)
yogacc33 Mar 11, 2025
a3c36bf
fix: update tests to mock CACHE_TMP_DIR for download and unzip functi…
darkrush Mar 12, 2025
cce1eee
Merge pull request #296 from darkrush/basemodel
darkrush Mar 12, 2025
f758d35
feat: 修复获取文本未保留换行问题、增加段落文本的测试用例 (#297)
LollipopsAndWine Mar 12, 2025
77b5e5e
feat: use the first item in predict result as langurage_details (#298)
2471023025 Mar 13, 2025
72f5037
fix: 修复空内容返回的语言详情,在176版本下language_details返回"not_defined" (#300)
darkrush Mar 13, 2025
18286da
fix: set mock CACHE_TMP_DIR to /tmp correctly
darkrush Mar 13, 2025
c1b2fda
fix: empty extract fomula (#304)
e06084 Mar 14, 2025
32ed991
[fix]: fix 一些特殊错误的公式 (#299)
e06084 Mar 14, 2025
dcdb988
feat: page classify
Mar 13, 2025
3f210f6
feat: classify page by html layout use GPU
Mar 18, 2025
efda13e
feat: raise CleanModelUnsupportedLanguageException in clean module
Mar 18, 2025
4fdad12
fix: clean model exception
Mar 18, 2025
4702438
feat: page classify
drunkpig Mar 18, 2025
4dbcecf
fix: code and text unit test
Mar 18, 2025
3ee2fba
fix: pipeline config move from source code to file
Mar 18, 2025
7f4fc08
Merge pull request #310 from drunkpig/dev9
drunkpig Mar 18, 2025
5039964
feat: math extract support mjx-container tag (#311)
e06084 Mar 19, 2025
cbcb1fc
feat: add MM_NODE_LIST in to_nlp_md (#305)
shijinpjlab Mar 19, 2025
8d0f7f3
feat: simple user api to extract html to markdown (#313)
drunkpig Mar 19, 2025
caf7159
feat: add title to DataJson (#314)
drunkpig Mar 19, 2025
3349d50
fix: html parser support xml_declaration (#315)
e06084 Mar 19, 2025
9200551
[feat] support math extract from mathjax config (#303)
e06084 Mar 19, 2025
53d4013
Update rule-based safety and model-based safety
Adela-Yu-Coder Mar 20, 2025
c46934c
feat: 将换行更换为双换行、配置全局常量 (#320)
LollipopsAndWine Mar 20, 2025
1f9c2ca
Revert "feat: 将换行更换为双换行、配置全局常量 (#320)" (#321)
yogacc33 Mar 20, 2025
584a355
build: setup add config files (#319)
yogacc33 Mar 20, 2025
1b18519
性能提升&问题修复 (#317)
dt-yy Mar 20, 2025
d360ff3
bench: update main_html extractor (#318)
e06084 Mar 20, 2025
4a3928b
update version (#323)
dt-yy Mar 20, 2025
ce41728
feat: 将获取段落文本时,br换成双换行 (#325)
LollipopsAndWine Mar 20, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
19 changes: 19 additions & 0 deletions .codecov.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
coverage:
status:
patch: # 只检查变更部分的覆盖率
default:
target: 85% # 变更代码的覆盖率目标
threshold: 2% # 允许的浮动范围
base: auto # 基于当前分支的覆盖率

# 忽略特定路径
ignore:
- "tests/**/*" # 忽略所有测试目录
- "**/__pycache__/**/*" # 忽略缓存文件
- "**/__init__.py" # 忽略初始化文件

# 可选:调整报告显示
comment:
layout: "reach, diff, flags, files"
behavior: default
require_changes: false
4 changes: 2 additions & 2 deletions .github/workflows/pr_stage_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,11 @@ jobs:
pr_ut_test:
runs-on: ubuntu-latest
env:
LLM_WEB_KIT_CFG_PATH: ${{ github.workspace }}/llm_web_kit/pipeline/pipe_tpl/pipeline_html_tpl.jsonc
LLM_WEB_KIT_CFG_PATH: ${{ github.workspace }}/bench/config/ours_config.jsonc
PYTHONPATH: $PYTHONPATH:${{ github.workspace }}
strategy:
matrix:
python-version: [3.10.15]
python-version: [3.10.16]
steps:
- name: Checkout code
uses: actions/checkout@v4
Expand Down
47 changes: 47 additions & 0 deletions .github/workflows/pr_ut_test_extra.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
name: pr_stage_ut_extra

on:
pull_request:
paths:
- 'requirements/**'
- 'setup.py'
push:
branches:
- main
- dev

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
# 额外的 Python 版本测试,只在 requirements 目录有修改时运行
pr_ut_test_extra:
runs-on: ubuntu-latest
env:
LLM_WEB_KIT_CFG_PATH: ${{ github.workspace }}/bench/config/ours_config.jsonc
PYTHONPATH: $PYTHONPATH:${{ github.workspace }}
strategy:
matrix:
python-version: [3.11.11, 3.12.8]
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Build llm_web_kit from source
run: |
pip install -e .
pip list | grep llm_web_kit
- name: Install unit tests dependencies
run: |
pip install -r requirements/runtime.txt
pip install -r requirements/dev.txt
- name: Run tests and collect coverage
run: pytest --cov --cov-report=xml -n auto ./tests/llm_web_kit
- name: Upload coverage reports to Codecov
uses: codecov/codecov-action@v5
with:
token: ${{ secrets.CODECOV_TOKEN }}
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
venv*/
envs/
slurm_logs/
local_tests/

__pycache__
*.log
Expand Down Expand Up @@ -45,3 +46,5 @@ output/
coverage.xml

llm_web_kit.egg-info/*
.llm-web-kit.jsonc
.llm-web-kit-pageclassify.jsonc
14 changes: 7 additions & 7 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
exclude: ^tests/llm_web_kit/pipeline/extractor/html/magic_html/assets/
exclude: ^tests/llm_web_kit/extractor/html/magic_html/assets/

repos:
- repo: https://github.com/PyCQA/flake8
rev: 5.0.4
hooks:
- id: flake8
args: [ "--max-line-length=2200", "--ignore=E131,E125,W503,W504,E203,E231,E702" ]
args: [ "--max-line-length=2200", "--ignore=E131,E125,W503,W504,E203,E231,E702,E128" ]
- repo: https://github.com/PyCQA/isort
rev: 5.11.5
hooks:
Expand All @@ -19,19 +19,19 @@ repos:
# rev: v2.2.1
# hooks:
# - id: codespell
# exclude: '^tests/.*/assets/'
# exclude: '^tests/.*/assets/|llm_web_kit/model/assets/.*'
# args: ['--skip', '*.json']
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.3.0
hooks:
- id: trailing-whitespace
exclude: '^tests/.*/assets/'
exclude: '^tests/.*/assets/|llm_web_kit/model/assets/.*'
- id: check-yaml
- id: end-of-file-fixer
exclude: '^tests/.*/assets/'
exclude: '^tests/.*/assets/|llm_web_kit/model/assets/.*'
- id: requirements-txt-fixer
- id: double-quote-string-fixer
exclude: '^tests/.*/assets/'
exclude: '^tests/.*/assets/|llm_web_kit/model/assets/.*'
- id: check-merge-conflict
- id: fix-encoding-pragma
args: [ "--remove" ]
Expand All @@ -46,7 +46,7 @@ repos:
- mdformat-openmmlab
- mdformat_frontmatter
- linkify-it-py
exclude: '^tests/.*/assets/'
exclude: '^tests/.*/assets/|llm_web_kit/model/assets/.*'
- repo: https://github.com/myint/docformatter
rev: v1.3.1
hooks:
Expand Down
20 changes: 20 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,26 @@ llm-web-kit is a python library that ..

## Quick Start

```python
from llm_web_kit.simple import extract_html_to_md
import traceback
from loguru import logger

def extract(url:str, html:str) -> str:
try:
nlp_md = extract_html_to_md(url, html)
# or mm_nlp_md = extract_html_to_mm_md(url, html)
return nlp_md
except Exception as e:
logger.exception(e)
return None

if __name__=="__main__":
url = ""
html = ""
markdown = extract(url, html)
```

## Usage

# TODO
Expand Down
105 changes: 101 additions & 4 deletions bench/Bench.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,22 +4,65 @@

# 目录结构

- `bench/config/data_config.jsonl`: 数据处理配置文件
- `bench/config/ours_config.jsonc`: 提取器配置文件
- `bench/eval/`: 不同评估工具的实现
- `bench/common/`: 通用评估工具和指标计算
- `bench/output/`: 评测结果输出目录

数据集:原始网页数据在`bench/data/origin`目录下,GT默认保存在`bench/data/groundtruth`目录下。
评测结果:评测结果默认保存在`bench/output`目录下日期+随机数的文件夹中,如`20250212_113509_5bbf75c0`。
评测结果:评测结果默认保存在`bench/output/{task_id}`目录下,其中`task_id`为UUID形式,如`5bbf7c8c-e8f2-11ef-a5a8-acde48001122`。

# 使用方法

## 命令行参数

```bash
python bench/run.py [--input INPUT_PATH] [--output OUTPUT_PATH] [--tool {ours,magic_html,unstructured}]
```

参数说明:

- `--input`: 指定HTML文件路径
- `--output`: 指定输出结果保存路径
- `--tool`: 选择使用的提取工具,可选值:
- `ours`: 使用本项目提供的提取工具(默认)
- `magic_html`: 使用magic_html工具进行评估
- `unstructured`: 使用unstructured工具进行评估

## 运行示例

1. 使用默认提取器运行评估:

```bash
python bench/run.py
```

2. 使用其他提取器进行对比:

```bash
python bench/run.py --tool magic_html
```
python run.py

3. 指定输入和输出路径:

```bash
python bench/run.py --input path/to/input.html --output path/to/output
```

# 评估报告及评估指标

每一个评测结果包含`summary.json`和`detail.json`两个文件,`summary.json`是整个评测集的汇总结果,`detail.json`是单个网页的详细结果。

主要评估指标包括:

- `type_acc`: 元素类型识别准确率
- `content_acc`: 元素内容识别准确率
- 各种元素类型的识别统计

# 评估报告示例

`summary.json`主要展示所有评测数据评测指标”,“评测耗时的整体和元素级别的结果:
`summary.json`主要展示所有评测数据"评测指标","评测耗时"的整体和元素级别的结果:

```json
{
Expand Down Expand Up @@ -52,7 +95,7 @@ python run.py
}
```

`detail.json`主要展示每个评测数据评测指标”,“评测耗时等元素级别的结果详情,方便分析哪一个网页的哪一个元素抽取效果不好:
`detail.json`主要展示每个评测数据"评测指标","评测耗时"等元素级别的结果详情,方便分析哪一个网页的哪一个元素抽取效果不好:

```json
{
Expand Down Expand Up @@ -96,6 +139,60 @@ python run.py
}
```

# 输出文件格式

评估结果将保存在指定的输出目录中,针对不同工具的输出格式如下:

## `ours`工具输出

输出为JSONL格式,每行是一个JSON对象,包含以下字段:

- `url`: 原始网页URL
- `content`: 提取的内容
- `main_html`: 提取的主要HTML内容
- `content_list`: 提取的内容列表
- `html`: 原始HTML
- `statics`: 统计信息

## `magic_html`和`unstructured`工具输出

输出为JSONL格式,每行是一个JSON对象,包含以下字段:

- `url`: 原始网页URL
- `content`: 提取的内容
- `html`: 原始HTML

# 故障排除

## 常见问题

1. 文件路径问题

如果遇到文件路径相关错误,请检查:

- 配置文件中的路径是否正确
- 文件路径是否存在
- 是否有足够的权限读写相关目录

2. 编码问题

对于包含XML声明的HTML文件,系统会自动进行转换处理。如果遇到编码相关错误,可以:

- 确保HTML文件使用UTF-8编码
- 检查XML声明是否正确格式化

3. 结果输出问题

若输出目录创建失败或结果无法写入:

- 检查目标目录的写入权限
- 确保磁盘空间足够

# 如何新增评估数据

评测数据集会根据`pipeline`的功能迭代新增数据,如何快速构建新增数据的`groundtruth`按照下面方法:

1. 准备原始HTML文件,放入`bench/data/origin`目录下
2. 在`bench/config/data_config.jsonl`中添加新的测试数据条目
3. 运行评估工具生成初步结果
4. 人工审核并修正结果作为groundtruth,放入`bench/data/groundtruth`目录
Loading
Loading