fix: apply Form XObject /Matrix and correct text matrix advance for rotated text#266
Open
bsickler wants to merge 4 commits intoyfedoseev:mainfrom
Open
Conversation
Text extraction was returning coordinates in Form XObject internal space instead of page space because process_xobject() did not read or apply the /Matrix entry. Per ISO 32000-1 §8.10.1, invoking a Form XObject via Do requires saving graphics state, concatenating /Matrix with the CTM, executing the content stream, then restoring graphics state. This adds all three missing steps (save, matrix concatenation, restore) to match the rendering pipeline's existing behavior in page_renderer.rs. Closes yfedoseev#264 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The text matrix advance after showing glyphs was computed as: advance = total_width / text_matrix.d.abs() For 90°-rotated text (d ≈ 0, b = 1), this divides by near-zero, producing coordinates in the hundreds of millions (e.g., y = 599,776,832 on page 4 of 154408.pdf for "Commercial Inventories MMbls"). Per ISO 32000-1:2008 §9.4.4, the text matrix update after showing text is: Tm_new = [1 0 0 1 tx 0] × Tm_old, where tx is the text-space displacement. This means: e_new = e + tx * a f_new = f + tx * b The division by d was unnecessary — total_width is already tx in text space. The /d happened to be a no-op for non-rotated text (a = d) but exploded for any rotation. Also fixes advance_position_for_offset (TJ array numeric offsets) which only updated text_matrix.e, ignoring f — causing incorrect positioning for rotated text. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Two related fixes for incorrect text span coordinates during extraction:
Form XObject /Matrix support: Apply the Form XObject's
/Matrixentry whenextracting text, with implicit
q/Qsave/restore per ISO 32000-1 §8.10.1.Previously the
/Matrixwas ignored, causing text inside transformed Form XObjectsto have incorrect positions.
Rotated text matrix advance: Fix divide-by-zero when advancing the text matrix
for 90°-rotated text. The advance was computed as
total_width / text_matrix.d.abs(),which explodes when
d ≈ 0(e.g., y = 599,776,832 for rotated chart axis labels).Per ISO 32000-1:2008 §9.4.4, the correct update is
Tm_new = [1 0 0 1 tx 0] × Tm_old,which requires no division. Also fixes
advance_position_for_offset(TJ numeric offsets)which only updated
text_matrix.e, ignoringf.Type of Change
Related Issue
Fixes #264
Testing
tests/test_form_xobject_matrix.rs) with 6 tests covering:Checklist