feat: expose word/line segmentation thresholds to Python by tboser · Pull Request #249 · yfedoseev/pdf_oxide

tboser · 2026-03-13T06:16:22Z

Description

Expose word/line segmentation thresholds and extraction configuration to Python.

Three additions:

Threshold overrides on extract_words() / extract_text_lines() — optional word_gap_threshold and line_gap_threshold kwargs (in PDF points). None preserves current adaptive behavior.
page_layout_params(page) — returns the computed adaptive thresholds and page statistics (LayoutParams object) so callers can inspect values before deciding on overrides.
ExtractionProfile — exposes the 9 pre-tuned extraction profiles (conservative, aggressive, balanced, academic, policy, form, government, scanned_ocr, adaptive) to Python with all their fields.

Motivation: Different document types benefit from different segmentation thresholds. The current adaptive computation works well for typical documents but can't be tuned for edge cases (dense forms, wide-spaced invoices, multi-column layouts). The internal Rust config system already supports this richness — this PR just threads it through to Python. Relates to #211.

Type of Change

New feature (non-breaking change which adds functionality)

Related Issues

Relates to #211

Changes Made

src/document.rs: Added extract_words_with_thresholds() and extract_text_lines_with_thresholds() — original methods delegate with None, zero breakage.
src/python.rs + src/python_main.rs: Added optional kwargs to extract_words() and extract_text_lines(). Added page_layout_params() method. Added PyLayoutParams and PyExtractionProfile types.
python/pdf_oxide/__init__.py: Exports LayoutParams and ExtractionProfile.
README.md, docs/getting-started-python.md, llms.txt: Updated with examples.

Python API

from pdf_oxide import PdfDocument, ExtractionProfile

doc = PdfDocument("form.pdf")

# === Threshold overrides ===
words = doc.extract_words(0, word_gap_threshold=2.5)
lines = doc.extract_text_lines(0, word_gap_threshold=2.5, line_gap_threshold=4.0)

# Region + thresholds compose:
words = doc.extract_words(0, region=(0, 0, 300, 400), word_gap_threshold=1.5)

# === Inspect adaptive params ===
params = doc.page_layout_params(0)
print(params)
# LayoutParams(word_gap=2.10, line_gap=3.80, char_width=7.00, font_size=12.00, ...)
words = doc.extract_words(0, word_gap_threshold=params.word_gap_threshold * 0.5)

# === Extraction profiles ===
profile = ExtractionProfile.form()
print(profile.name)                # "Form"
print(profile.word_margin_ratio)   # 0.08
ExtractionProfile.available()      # list of all profile names

Testing

All new and existing tests pass locally
cargo test --lib — 4279 passed, 0 failed
cargo test --tests — all integration tests pass
cargo clippy -- -D warnings — clean
cargo fmt — clean

Python Bindings (if applicable)

Python bindings updated (both python.rs and python_main.rs)

Documentation

README, docs/getting-started-python.md, llms.txt updated
Examples added for all new APIs
python/pdf_oxide/init.py exports new types

Checklist

Code follows project's coding guidelines (CONTRIBUTING.md)
Self-reviewed
Doc comments on all new public items
No new warnings
PR title follows conventional commits format

Add optional `word_gap_threshold` and `line_gap_threshold` kwargs to `extract_words()` and `extract_text_lines()` Python methods. When None (default), adaptive thresholds are computed from page statistics as before. When provided (in PDF points), they override the adaptive values for fine-grained control over segmentation. Rust API adds `extract_words_with_thresholds()` and `extract_text_lines_with_thresholds()` — the original methods delegate to these with None, so all existing callers are unaffected. Python usage: words = doc.extract_words(0, word_gap_threshold=2.5) lines = doc.extract_text_lines(0, word_gap_threshold=2.5, line_gap_threshold=4.0)

Update README.md, docs/getting-started-python.md, and llms.txt to document the new word_gap_threshold and line_gap_threshold optional parameters on extract_words() and extract_text_lines().

Add two new Python types: - LayoutParams: returned by doc.page_layout_params(page), exposes the computed adaptive thresholds (word_gap, line_gap) and page statistics (median_char_width, median_font_size, etc.) so callers can make informed decisions about threshold overrides. - ExtractionProfile: pre-tuned profiles (form, academic, policy, etc.) accessible via static methods. Exposes TJ-pipeline thresholds for different document types. Both types are registered in the Python module and exported from python/pdf_oxide/__init__.py.

Manual-dispatch workflow that builds manylinux x86_64 and macOS arm64 wheels for Python 3.11 + 3.12, then creates a GitHub release with all wheels attached.

yfedoseev · 2026-04-02T15:48:33Z

Thank you @tboser for this well-designed feature! We'd love to include it but have a few requests before merging:

Missing Python tests — Please add tests for the new word_gap_threshold/line_gap_threshold kwargs on extract_words() and extract_text_lines(), plus page_layout_params().
Docstring bug — In python_main.rs around line 582, the docstrings for has_structure_tree and page_layout_params appear to bleed together. Please add a separator.
ExtractionProfile — Currently exposed as inspect-only with no way to pass a profile to extraction methods. Consider either adding a profile= kwarg, or deferring ExtractionProfile exposure to a follow-up PR.
Note: python_main.rs has been removed in main (replaced by Rylai stub generation via ci: replace mypy stubgen with Rylai for better .pyi generation (#229) #250), so those changes can be dropped.

We're targeting v0.3.19 — happy to include this once the above are addressed!

tboser added 6 commits March 12, 2026 23:00

docs: add threshold parameter examples to Python API docs

6b024df

Update README.md, docs/getting-started-python.md, and llms.txt to document the new word_gap_threshold and line_gap_threshold optional parameters on extract_words() and extract_text_lines().

ci: add workflow to build and release Python wheels from fork

c57c77d

Manual-dispatch workflow that builds manylinux x86_64 and macOS arm64 wheels for Python 3.11 + 3.12, then creates a GitHub release with all wheels attached.

revert: remove build-wheels workflow from PR

d85c38d

fix: pass None thresholds from PyPdfPageRegion delegate calls

8e09f5c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: expose word/line segmentation thresholds to Python#249

feat: expose word/line segmentation thresholds to Python#249
tboser wants to merge 6 commits intoyfedoseev:mainfrom
tboser:feature/expose-word-line-thresholds

tboser commented Mar 13, 2026 •

edited

Loading

Uh oh!

yfedoseev commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

tboser commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues

Changes Made

Python API

Testing

Python Bindings (if applicable)

Documentation

Checklist

Uh oh!

yfedoseev commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tboser commented Mar 13, 2026 •

edited

Loading