Skip to content

Conversation

@kermitt2
Copy link
Collaborator

Chicago reference style has this awful usage of 3em dash to repeat one or several, or all, authors of the previous reference. Although this practice seems to be removed or restricted by the the latest Chicago style guidelines, latex for instance still work with older guidelines and there are tons of back files with this style.

This PR tries to cover 3em dash in references (some training data for this has been added separately), when each three 3em dash sequence refers to one author. The case with one 3em dash sequence used to refer to all the previous authors is not covered, because it seems ambiguous with the case the first author is repeated.

However the real main problems with this crappy mechanism are with OCR, in particular older OCRized PDF. These dashes are never correctly recognized and the reconstruction of the author list becomes just impossible.

Some example:

Screenshot from 2023-05-13 13-34-45

Screenshot from 2023-05-13 13-35-29

With numbers:

Screenshot from 2023-05-13 13-35-47

And finally three 3em dash to repeat all the authors of the previous reference (not just the first!).

Screenshot from 2023-05-13 13-34-19

@kermitt2 kermitt2 marked this pull request as draft December 18, 2023 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants