fix: prescan loses outer CTM context for deeply nested text by bsickler · Pull Request #267 · yfedoseev/pdf_oxide

bsickler · 2026-03-20T21:24:46Z

Description

The SIMD prescan optimization for large content streams (>256KB) scans backwards
up to 4KB from each BT operator to capture graphics state context. For deeply
nested text (e.g., chart axis labels inside scaled coordinate systems), the outer
scaling cm operators are beyond 4KB and get lost, producing extreme coordinates
(e.g., x=132,145 on a 612-wide page).

This PR adds a lightweight forward CTM scan that tracks q/Q/cm/Tf across
the full stream and records the correct graphics state at each BT/Do position.
This is much cheaper than full parsing because it only recognizes four operator types
and skips all path, color, and text operators. The prescan performance optimization
is preserved while producing correct coordinates.

When the forward scan cannot be used, the code falls back to full stream parsing
for correctness.

Type of Change

Bug fix
New feature
Breaking change
Documentation update

Related Issue

Fixes #265

Testing

10 prescan-related unit tests covering:
- Forward CTM captures outer scaling for text in >256KB streams
- Font state inheritance across consecutive BT blocks without Tf
- CTM correctly restored after nested q/Q scopes
- Multiple BT blocks at different nesting depths get correct CTMs
- Existing prescan tests updated for new PrescanResult enum
Verified against PDFs with chart axis labels in deeply nested coordinate systems
All 4283 library tests pass

Checklist

Tests pass
Code formatted
Documentation updated (no user-facing API changes)

…edoseev#265)

… and multiple depths

bsickler · 2026-03-20T21:48:07Z

I also ran pdf_extraction_performance benchmarks and saw 0 meaningful changes, although I was missing the government/cfr_excerpt.pdf example.

@hansmrtn

Merged PRs: - #251 fix: prevent panic on multi-byte UTF-8 in log slicing (@hansmrtn) - #273 fix: markdown mode drops spaces around styled text (@jorlow) - #266 fix: apply Form XObject /Matrix and correct text matrix advance (@bsickler) - #267 fix: prescan loses outer CTM context for deeply nested text (@bsickler) - #261 feat: expose path operations in extract_paths() Python (@willywg) WASM DevEx alignment: - Added operations array to extractPaths() WASM output, matching Python (previously only had operations_count)

bsickler added 4 commits March 20, 2026 12:30

fix: fall back to full parse when prescan loses outer CTM context (yf…

3197fef

…edoseev#265)

fix: fall back to full parse when prescan loses outer CTM context (yf…

063bed3

…edoseev#265)

fix: resolve merge conflicts, keep updated documentation

590cf66

test: add forward CTM scan tests for font inheritance, scope restore,…

112cd00

… and multiple depths

docs: remove orphaned doc comment from prescan refactor

39b7d95

bsickler mentioned this pull request Mar 23, 2026

[Bug]: XY-Cut projection allocates 30+ GB from extreme span coordinates (prescan CTM loss) #272

Closed

yfedoseev mentioned this pull request Apr 3, 2026

feat: v0.3.19 — text extraction accuracy, column-aware reading order, and community contributions #279

Merged

5 tasks

yfedoseev closed this in #279 Apr 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: prescan loses outer CTM context for deeply nested text#267

fix: prescan loses outer CTM context for deeply nested text#267
bsickler wants to merge 5 commits intoyfedoseev:mainfrom
bsickler:fix/prescan-ctm-loss-extreme-coords

bsickler commented Mar 20, 2026

Uh oh!

bsickler commented Mar 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

bsickler commented Mar 20, 2026

Description

Type of Change

Related Issue

Testing

Checklist

Uh oh!

bsickler commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

bsickler commented Mar 20, 2026 •

edited

Loading