This repository is a sub-project intended to be integrated at the tools/example_eval of the main GLM-OCR project. It scores the examples/result against:
examples/reference_resultfor upstream parityexamples/golden_resultfor golden-adjudicated qualitytools/example_eval/config/rules/*.yamlfor deterministic example-specific checks
The evaluator is intentionally stand-alone from the main project. It lives under
tools/example_eval/, reads the existing example corpus, and writes reports under
.build/example_eval/ in the project root.
The scorer produces three primary dimensions per example:
- text_fidelity: first-class text/content quality, including stricter handling for fenced code
- critical_structure: first-class structure quality, especially tables and OCR JSON block structure
- decorative_style: second-class markdown/style fidelity (bold, centering wrappers, fence labels, etc.)
It then computes:
- parity_overall:
resultvsreference_result - result_to_golden_overall:
resultvsgolden_result - reference_to_golden_overall:
reference_resultvsgolden_result - quality_overall (derived):
result_to_golden_overallwhen available; otherwiseparity_overall - final_overall: parity-first score with a small golden correction
- final_minus_quality (diagnostic):
final_overall - quality_overall(high values usually mean upstream is also far from golden)
Recommended usage:
- use
parity_overall/final_overallfor regression detection against upstream reference - use
quality_overallfor absolute OCR usefulness when a golden baseline exists
From the project root:
uv run --project tools/example_eval example-eval evaluate --repo-root .Or without installing the package:
PYTHONPATH=tools/example_eval/src python -m example_eval evaluate --repo-root .The default report directory is:
.build/example_eval/
Evaluate all examples:
uv run --project tools/example_eval example-eval evaluate --repo-root .Evaluate one example:
uv run --project tools/example_eval example-eval evaluate --repo-root . --example handwrittenNote: --example must match an example discovered from examples/source/* stems; unknown names now fail fast.
Fail the command if any example falls below a threshold:
uv run --project tools/example_eval example-eval evaluate --repo-root . --fail-under 0.90--fail-under must be within [0, 1].
Run tests:
uv run --project tools/example_eval pytest tools/example_eval/teststools/example_eval/
README.md
AGENTS.md
pyproject.toml
config/
policy.yaml
rules/
GLM-4.5V_Pages_1_2_3.yaml
code.yaml
handwritten.yaml
page.yaml
paper.yaml
src/example_eval/
cli.py
evaluator.py
json_metrics.py
markdown_ir.py
policy.py
report.py
rules.py
text_metrics.py
types.py
tests/
Controls:
- scoring weights
- table vs JSON structure weighting
- golden adjudication strength
- rule adjudication strength/weights
- report thresholds (e.g., inflation warnings)
- optional CI failure threshold
Stores deterministic, example-specific checks (with severities like minor/major/critical). The repo currently includes:
handwritten.yamlto reward the corrected人间readingGLM-4.5V_Pages_1_2_3.yamlto encode the verified page-boundary/continuation notespage.yamlto anchor the0.2\\mathrm{N} / \\mathrm{mm}^{2}glue-strength constant/unitpaper.yamlto anchor fragile math/notation phrases (e.g.,not divisible by Q,\\nabla^2)code.yamlto anchor critical XML tags/identifiers (e.g.,local-jndi-name,weblogic-rdbms-bean)
This scaffold is meant to be immediately useful, not just aspirational. It already:
- canonicalizes HTML tables and Markdown pipe tables to the same internal representation
- scores prose/code/table content separately
- reads OCR JSON block lists for structural signals
- applies golden adjudication per dimension
- emits Markdown, JSON, and JUnit reports
It is still a starter repo. Obvious next extensions are:
- stronger Markdown/HTML AST canonicalization
- richer table tree-edit metrics
- more rule types
- CI wiring in the parent repository