Skip to content

fix: apply Form XObject /Matrix and correct text matrix advance for rotated text#266

Open
bsickler wants to merge 4 commits intoyfedoseev:mainfrom
bsickler:fix/apply-form-xobject-matrix-in-text-extraction
Open

fix: apply Form XObject /Matrix and correct text matrix advance for rotated text#266
bsickler wants to merge 4 commits intoyfedoseev:mainfrom
bsickler:fix/apply-form-xobject-matrix-in-text-extraction

Conversation

@bsickler
Copy link
Copy Markdown

Description

Two related fixes for incorrect text span coordinates during extraction:

  1. Form XObject /Matrix support: Apply the Form XObject's /Matrix entry when
    extracting text, with implicit q/Q save/restore per ISO 32000-1 §8.10.1.
    Previously the /Matrix was ignored, causing text inside transformed Form XObjects
    to have incorrect positions.

  2. Rotated text matrix advance: Fix divide-by-zero when advancing the text matrix
    for 90°-rotated text. The advance was computed as total_width / text_matrix.d.abs(),
    which explodes when d ≈ 0 (e.g., y = 599,776,832 for rotated chart axis labels).
    Per ISO 32000-1:2008 §9.4.4, the correct update is Tm_new = [1 0 0 1 tx 0] × Tm_old,
    which requires no division. Also fixes advance_position_for_offset (TJ numeric offsets)
    which only updated text_matrix.e, ignoring f.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update

Related Issue

Fixes #264

Testing

  • Integration test suite (tests/test_form_xobject_matrix.rs) with 6 tests covering:
    • Scaling matrix transforms position and font size
    • Translation-only matrix offsets position
    • Missing /Matrix defaults to identity per spec
    • XObject matrix state does not leak to parent page content
    • Nested Form XObjects with composed transforms (translate + scale)
    • 90°-rotated text does not produce extreme coordinates (divide-by-zero regression)
  • All 4280 library tests pass

Checklist

  • Tests pass
  • Code formatted
  • Documentation updated (no user-facing API changes)

bsickler and others added 4 commits March 19, 2026 22:42
Text extraction was returning coordinates in Form XObject internal space
instead of page space because process_xobject() did not read or apply
the /Matrix entry. Per ISO 32000-1 §8.10.1, invoking a Form XObject via
Do requires saving graphics state, concatenating /Matrix with the CTM,
executing the content stream, then restoring graphics state.

This adds all three missing steps (save, matrix concatenation, restore)
to match the rendering pipeline's existing behavior in page_renderer.rs.

Closes yfedoseev#264

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The text matrix advance after showing glyphs was computed as:
  advance = total_width / text_matrix.d.abs()

For 90°-rotated text (d ≈ 0, b = 1), this divides by near-zero,
producing coordinates in the hundreds of millions (e.g., y = 599,776,832
on page 4 of 154408.pdf for "Commercial Inventories MMbls").

Per ISO 32000-1:2008 §9.4.4, the text matrix update after showing text
is: Tm_new = [1 0 0 1 tx 0] × Tm_old, where tx is the text-space
displacement. This means:
  e_new = e + tx * a
  f_new = f + tx * b

The division by d was unnecessary — total_width is already tx in text
space. The /d happened to be a no-op for non-rotated text (a = d) but
exploded for any rotation.

Also fixes advance_position_for_offset (TJ array numeric offsets) which
only updated text_matrix.e, ignoring f — causing incorrect positioning
for rotated text.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: extract_spans() returns Form XObject text coordinates in internal space, not page space

1 participant