Skip to content

feat: expose word/line segmentation thresholds to Python#249

Open
tboser wants to merge 6 commits intoyfedoseev:mainfrom
tboser:feature/expose-word-line-thresholds
Open

feat: expose word/line segmentation thresholds to Python#249
tboser wants to merge 6 commits intoyfedoseev:mainfrom
tboser:feature/expose-word-line-thresholds

Conversation

@tboser
Copy link
Copy Markdown

@tboser tboser commented Mar 13, 2026

Description

Expose word/line segmentation thresholds and extraction configuration to Python.

Three additions:

  1. Threshold overrides on extract_words() / extract_text_lines() — optional word_gap_threshold and line_gap_threshold kwargs (in PDF points). None preserves current adaptive behavior.

  2. page_layout_params(page) — returns the computed adaptive thresholds and page statistics (LayoutParams object) so callers can inspect values before deciding on overrides.

  3. ExtractionProfile — exposes the 9 pre-tuned extraction profiles (conservative, aggressive, balanced, academic, policy, form, government, scanned_ocr, adaptive) to Python with all their fields.

Motivation: Different document types benefit from different segmentation thresholds. The current adaptive computation works well for typical documents but can't be tuned for edge cases (dense forms, wide-spaced invoices, multi-column layouts). The internal Rust config system already supports this richness — this PR just threads it through to Python. Relates to #211.

Type of Change

  • New feature (non-breaking change which adds functionality)

Related Issues

Relates to #211

Changes Made

  • src/document.rs: Added extract_words_with_thresholds() and extract_text_lines_with_thresholds() — original methods delegate with None, zero breakage.
  • src/python.rs + src/python_main.rs: Added optional kwargs to extract_words() and extract_text_lines(). Added page_layout_params() method. Added PyLayoutParams and PyExtractionProfile types.
  • python/pdf_oxide/__init__.py: Exports LayoutParams and ExtractionProfile.
  • README.md, docs/getting-started-python.md, llms.txt: Updated with examples.

Python API

from pdf_oxide import PdfDocument, ExtractionProfile

doc = PdfDocument("form.pdf")

# === Threshold overrides ===
words = doc.extract_words(0, word_gap_threshold=2.5)
lines = doc.extract_text_lines(0, word_gap_threshold=2.5, line_gap_threshold=4.0)

# Region + thresholds compose:
words = doc.extract_words(0, region=(0, 0, 300, 400), word_gap_threshold=1.5)

# === Inspect adaptive params ===
params = doc.page_layout_params(0)
print(params)
# LayoutParams(word_gap=2.10, line_gap=3.80, char_width=7.00, font_size=12.00, ...)
words = doc.extract_words(0, word_gap_threshold=params.word_gap_threshold * 0.5)

# === Extraction profiles ===
profile = ExtractionProfile.form()
print(profile.name)                # "Form"
print(profile.word_margin_ratio)   # 0.08
ExtractionProfile.available()      # list of all profile names

Testing

  • All new and existing tests pass locally
  • cargo test --lib — 4279 passed, 0 failed
  • cargo test --tests — all integration tests pass
  • cargo clippy -- -D warnings — clean
  • cargo fmt — clean

Python Bindings (if applicable)

  • Python bindings updated (both python.rs and python_main.rs)

Documentation

  • README, docs/getting-started-python.md, llms.txt updated
  • Examples added for all new APIs
  • python/pdf_oxide/init.py exports new types

Checklist

  • Code follows project's coding guidelines (CONTRIBUTING.md)
  • Self-reviewed
  • Doc comments on all new public items
  • No new warnings
  • PR title follows conventional commits format

tboser added 6 commits March 12, 2026 23:00
Add optional `word_gap_threshold` and `line_gap_threshold` kwargs to
`extract_words()` and `extract_text_lines()` Python methods. When None
(default), adaptive thresholds are computed from page statistics as
before. When provided (in PDF points), they override the adaptive
values for fine-grained control over segmentation.

Rust API adds `extract_words_with_thresholds()` and
`extract_text_lines_with_thresholds()` — the original methods delegate
to these with None, so all existing callers are unaffected.

Python usage:
  words = doc.extract_words(0, word_gap_threshold=2.5)
  lines = doc.extract_text_lines(0, word_gap_threshold=2.5, line_gap_threshold=4.0)
Update README.md, docs/getting-started-python.md, and llms.txt to
document the new word_gap_threshold and line_gap_threshold optional
parameters on extract_words() and extract_text_lines().
Add two new Python types:
- LayoutParams: returned by doc.page_layout_params(page), exposes the
  computed adaptive thresholds (word_gap, line_gap) and page statistics
  (median_char_width, median_font_size, etc.) so callers can make
  informed decisions about threshold overrides.
- ExtractionProfile: pre-tuned profiles (form, academic, policy, etc.)
  accessible via static methods. Exposes TJ-pipeline thresholds for
  different document types.

Both types are registered in the Python module and exported from
python/pdf_oxide/__init__.py.
Manual-dispatch workflow that builds manylinux x86_64 and macOS arm64
wheels for Python 3.11 + 3.12, then creates a GitHub release with all
wheels attached.
@yfedoseev
Copy link
Copy Markdown
Owner

Thank you @tboser for this well-designed feature! We'd love to include it but have a few requests before merging:

  1. Missing Python tests — Please add tests for the new word_gap_threshold/line_gap_threshold kwargs on extract_words() and extract_text_lines(), plus page_layout_params().
  2. Docstring bug — In python_main.rs around line 582, the docstrings for has_structure_tree and page_layout_params appear to bleed together. Please add a separator.
  3. ExtractionProfile — Currently exposed as inspect-only with no way to pass a profile to extraction methods. Consider either adding a profile= kwarg, or deferring ExtractionProfile exposure to a follow-up PR.
  4. Note: python_main.rs has been removed in main (replaced by Rylai stub generation via ci: replace mypy stubgen with Rylai for better .pyi generation (#229) #250), so those changes can be dropped.

We're targeting v0.3.19 — happy to include this once the above are addressed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants