Skip to content

fix: prescan loses outer CTM context for deeply nested text#267

Closed
bsickler wants to merge 5 commits intoyfedoseev:mainfrom
bsickler:fix/prescan-ctm-loss-extreme-coords
Closed

fix: prescan loses outer CTM context for deeply nested text#267
bsickler wants to merge 5 commits intoyfedoseev:mainfrom
bsickler:fix/prescan-ctm-loss-extreme-coords

Conversation

@bsickler
Copy link
Copy Markdown
Contributor

Description

The SIMD prescan optimization for large content streams (>256KB) scans backwards
up to 4KB from each BT operator to capture graphics state context. For deeply
nested text (e.g., chart axis labels inside scaled coordinate systems), the outer
scaling cm operators are beyond 4KB and get lost, producing extreme coordinates
(e.g., x=132,145 on a 612-wide page).

This PR adds a lightweight forward CTM scan that tracks q/Q/cm/Tf across
the full stream and records the correct graphics state at each BT/Do position.
This is much cheaper than full parsing because it only recognizes four operator types
and skips all path, color, and text operators. The prescan performance optimization
is preserved while producing correct coordinates.

When the forward scan cannot be used, the code falls back to full stream parsing
for correctness.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update

Related Issue

Fixes #265

Testing

  • 10 prescan-related unit tests covering:
    • Forward CTM captures outer scaling for text in >256KB streams
    • Font state inheritance across consecutive BT blocks without Tf
    • CTM correctly restored after nested q/Q scopes
    • Multiple BT blocks at different nesting depths get correct CTMs
    • Existing prescan tests updated for new PrescanResult enum
  • Verified against PDFs with chart axis labels in deeply nested coordinate systems
  • All 4283 library tests pass

Checklist

  • Tests pass
  • Code formatted
  • Documentation updated (no user-facing API changes)

@bsickler
Copy link
Copy Markdown
Contributor Author

bsickler commented Mar 20, 2026

I also ran pdf_extraction_performance benchmarks and saw 0 meaningful changes, although I was missing the government/cfr_excerpt.pdf example.

yfedoseev added a commit that referenced this pull request Apr 3, 2026
Merged PRs:
- #251 fix: prevent panic on multi-byte UTF-8 in log slicing (@hansmrtn)
- #273 fix: markdown mode drops spaces around styled text (@jorlow)
- #266 fix: apply Form XObject /Matrix and correct text matrix advance (@bsickler)
- #267 fix: prescan loses outer CTM context for deeply nested text (@bsickler)
- #261 feat: expose path operations in extract_paths() Python (@willywg)

WASM DevEx alignment:
- Added operations array to extractPaths() WASM output, matching Python
  (previously only had operations_count)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Text spans have extreme/nonsensical coordinates for PDFs with large content streams (>256KB)

1 participant