feat: expose word/line segmentation thresholds to Python#249
Open
tboser wants to merge 6 commits intoyfedoseev:mainfrom
Open
feat: expose word/line segmentation thresholds to Python#249tboser wants to merge 6 commits intoyfedoseev:mainfrom
tboser wants to merge 6 commits intoyfedoseev:mainfrom
Conversation
Add optional `word_gap_threshold` and `line_gap_threshold` kwargs to `extract_words()` and `extract_text_lines()` Python methods. When None (default), adaptive thresholds are computed from page statistics as before. When provided (in PDF points), they override the adaptive values for fine-grained control over segmentation. Rust API adds `extract_words_with_thresholds()` and `extract_text_lines_with_thresholds()` — the original methods delegate to these with None, so all existing callers are unaffected. Python usage: words = doc.extract_words(0, word_gap_threshold=2.5) lines = doc.extract_text_lines(0, word_gap_threshold=2.5, line_gap_threshold=4.0)
Update README.md, docs/getting-started-python.md, and llms.txt to document the new word_gap_threshold and line_gap_threshold optional parameters on extract_words() and extract_text_lines().
Add two new Python types: - LayoutParams: returned by doc.page_layout_params(page), exposes the computed adaptive thresholds (word_gap, line_gap) and page statistics (median_char_width, median_font_size, etc.) so callers can make informed decisions about threshold overrides. - ExtractionProfile: pre-tuned profiles (form, academic, policy, etc.) accessible via static methods. Exposes TJ-pipeline thresholds for different document types. Both types are registered in the Python module and exported from python/pdf_oxide/__init__.py.
Manual-dispatch workflow that builds manylinux x86_64 and macOS arm64 wheels for Python 3.11 + 3.12, then creates a GitHub release with all wheels attached.
Owner
|
Thank you @tboser for this well-designed feature! We'd love to include it but have a few requests before merging:
We're targeting v0.3.19 — happy to include this once the above are addressed! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Expose word/line segmentation thresholds and extraction configuration to Python.
Three additions:
Threshold overrides on
extract_words()/extract_text_lines()— optionalword_gap_thresholdandline_gap_thresholdkwargs (in PDF points).Nonepreserves current adaptive behavior.page_layout_params(page)— returns the computed adaptive thresholds and page statistics (LayoutParamsobject) so callers can inspect values before deciding on overrides.ExtractionProfile— exposes the 9 pre-tuned extraction profiles (conservative, aggressive, balanced, academic, policy, form, government, scanned_ocr, adaptive) to Python with all their fields.Motivation: Different document types benefit from different segmentation thresholds. The current adaptive computation works well for typical documents but can't be tuned for edge cases (dense forms, wide-spaced invoices, multi-column layouts). The internal Rust config system already supports this richness — this PR just threads it through to Python. Relates to #211.
Type of Change
Related Issues
Relates to #211
Changes Made
src/document.rs: Addedextract_words_with_thresholds()andextract_text_lines_with_thresholds()— original methods delegate withNone, zero breakage.src/python.rs+src/python_main.rs: Added optional kwargs toextract_words()andextract_text_lines(). Addedpage_layout_params()method. AddedPyLayoutParamsandPyExtractionProfiletypes.python/pdf_oxide/__init__.py: ExportsLayoutParamsandExtractionProfile.README.md,docs/getting-started-python.md,llms.txt: Updated with examples.Python API
Testing
cargo test --lib— 4279 passed, 0 failedcargo test --tests— all integration tests passcargo clippy -- -D warnings— cleancargo fmt— cleanPython Bindings (if applicable)
python.rsandpython_main.rs)Documentation
Checklist