Skip to content

Conversation

@bnewbold
Copy link
Contributor

The current arxiv.org identifier matcher regex requires an "arXiv:" prefix. This misses some un-ambiguous old-style identifiers in short citations, like these examples;

B.A. Dobrescu, hep-ph/9510424.
K.R. Dienes, C. Kolda and J. March-Russell, hep-ph/9610479.
S.P. Martin, hep-ph/9608224.

This PR extends the regex to allow these cases, by more conservatively matching the "section" prefix, and not allowing whitespace, to reduce the risk of matching on random garbled text. The above examples are added as citation training data.

I also noticed a handful of existing training citations which were missing type="arxiv" on the <idno> tag, so I added those.

There is some possible ambiguity about arxiv vs. arXiv in the TEI data: GROBID output has the capitalized X, while the training data is lower-case. Also, while all (or almost all) the citation training data has these annotations, none of the header <idno> annotations do. Not sure if it would be helpful to add those.

This is related to internetarchive/fatcat#84 and to #275

The simple part of this is allowing 'arxiv:' in addition to 'arXiv:'.

The more complex second part is to conservatively match "old" (pre-2008)
style identifiers which do not have a prefix. The conservative matching
is because there is less confidence that a string is actually an arxiv
identifier without the prefix. Explicit collection prefixes are included
(for those that existed pre-2008), internal whitespace is not allowed,
and the identifier must be separated from other alphabetic strings.
…fiers

These old-style arxiv identifiers have no prefix ("arxiv:"), but are
unambiguously arxiv.org identifiers.
@elonzh
Copy link
Contributor

elonzh commented Nov 13, 2021

FYI, https://github.com/mattbierbaum/arxiv-public-datasets/blob/master/arxiv_public_data/regex_arxiv.py

@kermitt2
Copy link
Collaborator

@bnewbold Thanks a lot for the improvement ! I think there are small issues in the new regex and I propose some changes in the review after some tests, if you could have a look?

@bnewbold
Copy link
Contributor Author

@kermitt2 thanks for reviewing.

I don't see any proposed changes or review notes here on github. Maybe I am missing something?

@kermitt2 kermitt2 self-requested a review December 16, 2021 07:07
static public final Pattern arXivPattern = Pattern
.compile("(arXiv\\s?(\\.org)?\\s?\\:\\s?\\d{4}\\s?\\.\\s?\\d{4,5}(v\\d+)?)|(arXiv\\s?(\\.org)?\\s?\\:\\s?[ a-zA-Z\\-\\.]*\\s?/\\s?\\d{7}(v\\d+)?)");
.compile("(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s??\\d{4}\\s?\\.\\s?\\d{4,5}(v\\d+)?)|(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s?[ a-zA-Z\\-\\.]{3,16}\\s?/\\s?\\d{7}(v\\d+)?)|([^a-zA-Z](math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\\-bio|q\\-fin)[a-zA-Z\\-\\.]*/\\d{7}(v\\d+)?)");

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing a bit the regex:

  • there is a double ?? in the first component which is a typo I think?
  • better avoid having the captured parenthesis for the sub-term (math|hep|astro|cond|gr|nucl|quat|stat|ph...), we could add a non captured parenthesis
    • I don't understand the [^a-zA-Z] before the (math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\\-bio|q\\-fin), it makes hep-lat/0509026 failing for example as it expects something before the arXiv identifier?

Suggested regex:

(ar[xX]iv\s?(\.org)?\s?\:\s?\d{4}\s?\.\s?\d{4,5}(v\d+)?)|(ar[xX]iv\s?(\.org)?\s?\:\s?[ a-zA-Z\-\.]{3,16}\s?/\s?\d{7}(v\d+)?)|((?:math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\-bio|q\-fin)[a-zA-Z\-\.]*/\d{7}(v\d+)?)

In java syntax:

(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s?\\d{4}\\s?\\.\\s?\\d{4,5}(v\\d+)?)|(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s?[ a-zA-Z\\-\\.]{3,16}\\s?/\\s?\\d{7}(v\\d+)?)|((?:math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\\-bio|q\\-fin)[a-zA-Z\\-\\.]*/\\d{7}(v\\d+)?)

We then have only one captured group for one arXiv identifier.

// "old style" without prefix (strict): 'hep-th/9901001v2', 'math/9901001'
static public final Pattern arXivPattern = Pattern
.compile("(arXiv\\s?(\\.org)?\\s?\\:\\s?\\d{4}\\s?\\.\\s?\\d{4,5}(v\\d+)?)|(arXiv\\s?(\\.org)?\\s?\\:\\s?[ a-zA-Z\\-\\.]*\\s?/\\s?\\d{7}(v\\d+)?)");
.compile("(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s??\\d{4}\\s?\\.\\s?\\d{4,5}(v\\d+)?)|(ar[xX]iv\\s?(\\.org)?\\s?\\:\\s?[ a-zA-Z\\-\\.]{3,16}\\s?/\\s?\\d{7}(v\\d+)?)|([^a-zA-Z](math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\\-bio|q\\-fin)[a-zA-Z\\-\\.]*/\\d{7}(v\\d+)?)");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://github.com/mattbierbaum/arxiv-public-datasets/blob/master/arxiv_public_data/regex_arxiv.py#L12 we have much more categories possible. Like for the sub-categories, we might want to simply have a free range of characters?
e.g. instead of

(math|hep|astro|cond|gr|nucl|quat|stat|physics|cs|nlim|q\\-bio|q\\-fin)[a-zA-Z\\-\\.]*

having simply:

[a-zA-Z\\-\\.]*

I am a bit worry then about the robustness of the regex for ill-formed input wrt. catastrophic backtracking as we have seen elsewhere. Maybe best alternative would be to enumerate all the possibilities?

@kermitt2
Copy link
Collaborator

Ok sorry I think I did not submit the review, can you see my comments now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants