expand scope of arxiv identifier matcher, and fix some training data annotations #858

bnewbold · 2021-11-13T01:57:56Z

The current arxiv.org identifier matcher regex requires an "arXiv:" prefix. This misses some un-ambiguous old-style identifiers in short citations, like these examples;

B.A. Dobrescu, hep-ph/9510424.
K.R. Dienes, C. Kolda and J. March-Russell, hep-ph/9610479.
S.P. Martin, hep-ph/9608224.

This PR extends the regex to allow these cases, by more conservatively matching the "section" prefix, and not allowing whitespace, to reduce the risk of matching on random garbled text. The above examples are added as citation training data.

I also noticed a handful of existing training citations which were missing type="arxiv" on the <idno> tag, so I added those.

There is some possible ambiguity about arxiv vs. arXiv in the TEI data: GROBID output has the capitalized X, while the training data is lower-case. Also, while all (or almost all) the citation training data has these annotations, none of the header <idno> annotations do. Not sure if it would be helpful to add those.

This is related to internetarchive/fatcat#84 and to #275

The simple part of this is allowing 'arxiv:' in addition to 'arXiv:'. The more complex second part is to conservatively match "old" (pre-2008) style identifiers which do not have a prefix. The conservative matching is because there is less confidence that a string is actually an arxiv identifier without the prefix. Explicit collection prefixes are included (for those that existed pre-2008), internal whitespace is not allowed, and the identifier must be separated from other alphabetic strings.

…fiers These old-style arxiv identifiers have no prefix ("arxiv:"), but are unambiguously arxiv.org identifiers.

elonzh · 2021-11-13T05:39:52Z

FYI, https://github.com/mattbierbaum/arxiv-public-datasets/blob/master/arxiv_public_data/regex_arxiv.py

kermitt2 · 2021-12-11T13:44:07Z

@bnewbold Thanks a lot for the improvement ! I think there are small issues in the new regex and I propose some changes in the review after some tests, if you could have a look?

bnewbold · 2021-12-16T01:23:15Z

@kermitt2 thanks for reviewing.

I don't see any proposed changes or review notes here on github. Maybe I am missing something?

kermitt2 · 2021-12-11T13:42:25Z

grobid-core/src/main/java/org/grobid/core/utilities/TextUtilities.java

    static public final Pattern arXivPattern = Pattern
-        .compile("(arXiv\\s?(\\.org)?\\s?\\:\\s?\\d{4}\\s?\\.\\s?\\d{4,5}(v\\d+)?)|(arXiv\\s?(\\.org)?\\s?\\:\\s?[ a-zA-Z\\-\\.]*\\s?/\\s?\\d{7}(v\\d+)?)");
+        .compile("(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s??\\d{4}\\s?\\.\\s?\\d{4,5}(v\\d+)?)|(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s?[ a-zA-Z\\-\\.]{3,16}\\s?/\\s?\\d{7}(v\\d+)?)|([^a-zA-Z](math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\\-bio|q\\-fin)[a-zA-Z\\-\\.]*/\\d{7}(v\\d+)?)");



Testing a bit the regex:

there is a double ?? in the first component which is a typo I think?

better avoid having the captured parenthesis for the sub-term (math|hep|astro|cond|gr|nucl|quat|stat|ph...), we could add a non captured parenthesis

I don't understand the [^a-zA-Z] before the (math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\\-bio|q\\-fin), it makes hep-lat/0509026 failing for example as it expects something before the arXiv identifier?

Suggested regex:

(ar[xX]iv\s?(\.org)?\s?\:\s?\d{4}\s?\.\s?\d{4,5}(v\d+)?)|(ar[xX]iv\s?(\.org)?\s?\:\s?[ a-zA-Z\-\.]{3,16}\s?/\s?\d{7}(v\d+)?)|((?:math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\-bio|q\-fin)[a-zA-Z\-\.]*/\d{7}(v\d+)?)

In java syntax:

(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s?\\d{4}\\s?\\.\\s?\\d{4,5}(v\\d+)?)|(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s?[ a-zA-Z\\-\\.]{3,16}\\s?/\\s?\\d{7}(v\\d+)?)|((?:math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\\-bio|q\\-fin)[a-zA-Z\\-\\.]*/\\d{7}(v\\d+)?)

We then have only one captured group for one arXiv identifier.

kermitt2 · 2021-12-11T13:51:32Z

grobid-core/src/main/java/org/grobid/core/utilities/TextUtilities.java

+    //   "old style" without prefix (strict): 'hep-th/9901001v2', 'math/9901001'
    static public final Pattern arXivPattern = Pattern
-        .compile("(arXiv\\s?(\\.org)?\\s?\\:\\s?\\d{4}\\s?\\.\\s?\\d{4,5}(v\\d+)?)|(arXiv\\s?(\\.org)?\\s?\\:\\s?[ a-zA-Z\\-\\.]*\\s?/\\s?\\d{7}(v\\d+)?)");
+        .compile("(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s??\\d{4}\\s?\\.\\s?\\d{4,5}(v\\d+)?)|(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s?[ a-zA-Z\\-\\.]{3,16}\\s?/\\s?\\d{7}(v\\d+)?)|([^a-zA-Z](math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\\-bio|q\\-fin)[a-zA-Z\\-\\.]*/\\d{7}(v\\d+)?)");


According to https://github.com/mattbierbaum/arxiv-public-datasets/blob/master/arxiv_public_data/regex_arxiv.py#L12 we have much more categories possible. Like for the sub-categories, we might want to simply have a free range of characters?
e.g. instead of

(math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\\-bio|q\\-fin)[a-zA-Z\\-\\.]*

having simply:

[a-zA-Z\\-\\.]*

I am a bit worry then about the robustness of the regex for ill-formed input wrt. catastrophic backtracking as we have seen elsewhere. Maybe best alternative would be to enumerate all the possibilities?

kermitt2 · 2021-12-16T07:09:54Z

Ok sorry I think I did not submit the review, can you see my comments now?

bnewbold added 3 commits November 12, 2021 17:00

citation training data: annotate <idno> which are arxiv identifiers

8f03c49

add handful of citation training examples with old-style arxiv identi…

fb04a85

…fiers These old-style arxiv identifiers have no prefix ("arxiv:"), but are unambiguously arxiv.org identifiers.

kermitt2 added the enhancement label Nov 13, 2021

kermitt2 self-requested a review December 16, 2021 07:07

kermitt2 requested changes Dec 16, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

expand scope of arxiv identifier matcher, and fix some training data annotations #858

expand scope of arxiv identifier matcher, and fix some training data annotations #858

Uh oh!

bnewbold commented Nov 13, 2021

Uh oh!

elonzh commented Nov 13, 2021

Uh oh!

kermitt2 commented Dec 11, 2021

Uh oh!

bnewbold commented Dec 16, 2021

Uh oh!

kermitt2 Dec 11, 2021

Uh oh!

kermitt2 Dec 11, 2021

Uh oh!

kermitt2 commented Dec 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

expand scope of arxiv identifier matcher, and fix some training data annotations #858

Are you sure you want to change the base?

expand scope of arxiv identifier matcher, and fix some training data annotations #858

Uh oh!

Conversation

bnewbold commented Nov 13, 2021

Uh oh!

elonzh commented Nov 13, 2021

Uh oh!

kermitt2 commented Dec 11, 2021

Uh oh!

bnewbold commented Dec 16, 2021

Uh oh!

kermitt2 Dec 11, 2021

Choose a reason for hiding this comment

Uh oh!

kermitt2 Dec 11, 2021

Choose a reason for hiding this comment

Uh oh!

kermitt2 commented Dec 16, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants