normalize strings based on the char map in the current font (if present)#1
normalize strings based on the char map in the current font (if present)#1blackghost1987 wants to merge 2 commits intomarkcatley:mainfrom
Conversation
|
nitpick: i'd call it |
|
Regarding fonts: I think the text extraction is far from solved and during this experimental stage, copying code around is quite normal. |
|
@blackghost1987 I am going to merge pdf-rs/pdf#89 soon, which may make this crate not really necessary. My suggestion would be to create a pdf-tools or pdf-toolbox with the text extraction code and everything else that pops up but does not fit into the main crate. |
|
Great! It would be nice to have the typed Operations in the main PDF crate, seems more logical. My main use-case is text extraction, so it would be awesome to have that as a feature in one of the crates, not just an example (even if it's not a full fledged working solution at first, I know it's not that straightforward). It doesn't really matter which crate has the text extraction feature though, separate them as you like. Feel free to close this one when it won't make sense anymore. |
|
https://github.com/pdf-rs/pdf_tools/ is online. for now using the |
This is mostly copied from the "text" example of the "pdf" crate, based on work by @s3bk
https://github.com/pdf-rs/pdf/blob/master/examples/text/src/main.rs
The FontInfo struct is created there as a way to cache the character maps retrieved from the Font resources. Unfortunately it's not part of the main codebase of the lib, it's just in the examples, so I can't import it and had to copy it as well. It looks like the handling of character maps is not mature enough yet (based on the comment in line 24), but for my use-case it was working fine, so I wanted to integrate it with the nicely typed Operations from this crate.
I think the parsing of the Fonts is not really in the scope of this crate, but it would be really useful to use them if present.