Skip to content

PageLayout to_s() merges TextRuns that overlap #290

@rockorequin

Description

@rockorequin

I notice that some PDFs have extra apparently spurious text in them, eg some bank statements (presumably the bank puts them into to make it hard to parse them).

An example is where you have a transparent text run of '6' in one text run and an amount of say '50.00' in a text run that overlaps the '6'. PDF Reader's Page text() method outputs these two as 650.00, so it incorrectly looks like the amount is $650 instead of $50. The overlap also occurs when the '6' ends in the column immediately before the '50.00'.

If I view the PDF in Evince, the spurious text is rendered transparently, so the document looks fine unless I select the text for copy and paste. In the pasted output, the two strings appear with a space between them, ie '6 50.00'. So it's not ideal, but at least the you can recognise that the amount is $50 and not $650.

The PageLayout to_s method is doing the hard work of mapping the TextRun objects and rendering them to a string. It calls local_string_insert to insert each text at its x_pos and y_pos (x_pos and y_pos are converted into columns from the raw x and y coords).

Brainstorming, there might be a couple of ways around this:

  • I have tried moving the text runs that overlap prior to calling PageLayout's to_s method (eg at the end of PageLayout's initialize method) to ensure that there is least one column between them. This fixes the issue - I get 6 50.00 instead of 650.00, so it matches how Evince works. I did it by grouping the text runs into a hash of { y column => [ ary of TextRun ] } and then sorting each ary by its start x column. Then I check for overlap by comparing the endx column of one text run against the following text run's x column. The disadvantage of doing this is that potentially you could lose text off the right hand side of the page, because to_s checks that the text run starts within the expected number of columns on the page before inserting it. Maybe we could remove that check so the text isn't lost.

  • We could add an alternative method to page.text() that returns the TextRun objects directly, eg as a hash of { y_column => [ ary of TextRun ] } or as an Array of [ ary of TextRun ]. If the TextRun object had methods to return its x_col and endx_col as well as the raw x and endx, the caller could figure out for themself where they are located on the page. (As a side benefit, the caller could also see the TextRun attributes like font_size and width. We could even make the TextRun store its font so the user could see which font is applicable, which might help with Parse font of given text #272.)

This might also be related to #43.

I'm using pdf-reader v2.2.0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions