normalize strings based on the char map in the current font (if present) by blackghost1987 · Pull Request #1 · markcatley/typed-pdf

blackghost1987 · 2021-05-25T10:05:11Z

This is mostly copied from the "text" example of the "pdf" crate, based on work by @s3bk
https://github.com/pdf-rs/pdf/blob/master/examples/text/src/main.rs

The FontInfo struct is created there as a way to cache the character maps retrieved from the Font resources. Unfortunately it's not part of the main codebase of the lib, it's just in the examples, so I can't import it and had to copy it as well. It looks like the handling of character maps is not mature enough yet (based on the comment in line 24), but for my use-case it was working fine, so I wanted to integrate it with the nicely typed Operations from this crate.

I think the parsing of the Fonts is not really in the scope of this crate, but it would be really useful to use them if present.

s3bk · 2021-05-25T10:08:13Z

nitpick: i'd call it decode_string instead.

s3bk · 2021-05-25T17:39:06Z

Regarding fonts:
While fonts play an important role in PDFs, there are plenty of cases that do not need them.
They are also a huge pain to get working reliable and I don't want to put too much burden on the core PDF crate.
And finally, pdf-rs/font is just one of many font parsers and one may wish to use a different one (for example to get hinting).

I think the text extraction is far from solved and during this experimental stage, copying code around is quite normal.

s3bk · 2021-05-27T08:41:20Z

@blackghost1987 I am going to merge pdf-rs/pdf#89 soon, which may make this crate not really necessary.

My suggestion would be to create a pdf-tools or pdf-toolbox with the text extraction code and everything else that pops up but does not fit into the main crate.

blackghost1987 · 2021-05-27T08:59:45Z

Great! It would be nice to have the typed Operations in the main PDF crate, seems more logical.

My main use-case is text extraction, so it would be awesome to have that as a feature in one of the crates, not just an example (even if it's not a full fledged working solution at first, I know it's not that straightforward). It doesn't really matter which crate has the text extraction feature though, separate them as you like.

Feel free to close this one when it won't make sense anymore.

s3bk · 2021-05-29T11:41:18Z

https://github.com/pdf-rs/pdf_tools/ is online. for now using the dev branch of the pdf repo.

normalize strings based on the char map in the current font if present

8a01763

rename to decode_string

835e8d6

s3bk approved these changes May 25, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

normalize strings based on the char map in the current font (if present)#1

normalize strings based on the char map in the current font (if present)#1
blackghost1987 wants to merge 2 commits intomarkcatley:mainfrom
coreconsult:normalize_string_with_font

blackghost1987 commented May 25, 2021

Uh oh!

s3bk commented May 25, 2021

Uh oh!

s3bk commented May 25, 2021

Uh oh!

s3bk commented May 27, 2021

Uh oh!

blackghost1987 commented May 27, 2021

Uh oh!

s3bk commented May 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

blackghost1987 commented May 25, 2021

Uh oh!

s3bk commented May 25, 2021

Uh oh!

s3bk commented May 25, 2021

Uh oh!

s3bk commented May 27, 2021

Uh oh!

blackghost1987 commented May 27, 2021

Uh oh!

s3bk commented May 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants