-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Related to:
Issue
On pre-print servers, like bioRxiv.org, different versions of a paper might be declared, but they'll all have the same DOI. DOI is one of the factors we use to determine if we should consider two documents documents with different URIs to be "equivalent", meaning that annotations made on one URI will also appear on that document when it's on the other URI.
For document equivalence information, see the issues linked above, and also:
- https://web.hypothes.is/help/how-to-establish-or-avoid-document-equivalence-in-the-hypothesis-system/
- https://web.hypothes.is/help/how-hypothesis-interacts-with-document-metadata/
It's important to note that DOI is not the only metadata on the following pages, and this is going to be part of the SPIKE. We also use rel=canonical as metadata we respect when deciding to perform document equivalence. One thing we need to investigate in this spike is when we use DOI, vs rel=canonical, and what we do when one of these says documents should be equivalent, and one of them says documents should not be equivalent.
Examples of document equivalence working, and not working
Example links will be to bioRxiv articles. Note that bioRxiv will include DOI metadata, re=canonical metadata, and will vary URLs based on document version (v1, v2, etc), and also differentiate abstracts and full text versions of the articles in the URL as well.
- Article abstract v1: https://www.biorxiv.org/content/10.1101/2023.07.02.547380v1
- meta name="citation_doi" content="10.1101/2023.07.02.547380"
- meta name="DC.Identifier" content="10.1101/2023.07.02.547380"
- link rel="canonical" href="https://www.biorxiv.org/content/10.1101/2023.07.02.547380v1"
- Article full text v1: https://www.biorxiv.org/content/10.1101/2023.07.02.547380v1.full
- meta name="citation_doi" content="10.1101/2023.07.02.547380"
- meta name="DC.Identifier" content="10.1101/2023.07.02.547380"
- link rel="canonical" href="https://www.biorxiv.org/content/10.1101/2023.07.02.547380v1"
- Article abstract v2: https://www.biorxiv.org/content/10.1101/2023.07.02.547380v2
- meta name="citation_doi" content="10.1101/2023.07.02.547380"
- meta name="DC.Identifier" content="10.1101/2023.07.02.547380"
- link rel="canonical" href="https://www.biorxiv.org/content/10.1101/2023.07.02.547380v2"
- Article full text v2: https://www.biorxiv.org/content/10.1101/2023.07.02.547380v2.full
- meta name="citation_doi" content="10.1101/2023.07.02.547380"
- meta name="DC.Identifier" content="10.1101/2023.07.02.547380"
- link rel="canonical" href="https://www.biorxiv.org/content/10.1101/2023.07.02.547380v2"
Example 1 - Document equivalence is not working (until someone makes an annotation on these pages - please don't make an annotation on these pages!)
- https://hyp.is/go?url=https%3A%2F%2Fwww.biorxiv.org%2Fcontent%2F10.1101%2F2023.07.02.547380v2&group=q5X6RWJ6
- https://hyp.is/go?url=https%3A%2F%2Fwww.biorxiv.org%2Fcontent%2F10.1101%2F2023.07.02.547380v1&group=q5X6RWJ6
Note that both of these are v1 and v2 of the same article, with the same DOI. They have different canonical URLs. I suspect that the DOI was just not available in the metadata when v1 was annotated, but there's no way to know that for sure (in fact, bioRxiv says that DOI has been available on these URLs "forever", so it's possible DOI was present in the metadata and we didn't record it for some reason.
I suspect that if we annotated on the v1 article now we would perform document equivalence, but in this case that is not what we want.
In our database these are document.id = 3229694 and document.id = 3094407. You can see the different in information recorded with:
select
*
from
document
join document_uri on document.id = document_uri.document_id
where
document.web_URI LIKE 'https://www.biorxiv.org/content/10.1101/2023.07.02.547380%'
Example 2 - Document equivalence is working.
- https://hyp.is/go?url=https%3A%2F%2Fwww.biorxiv.org%2Fcontent%2F10.1101%2F473314v1&group=Pi3Pmdmm
- https://hyp.is/go?url=https%3A%2F%2Fwww.biorxiv.org%2Fcontent%2F10.1101%2F473314v2&group=Pi3Pmdmm
These are also v1 and v2 URLs, but they load the same set of annotations. in our database both of these URLs are document.id = 4170604.
Questions to be answered as part of this SPIKE
- When we have some metadata that would trigger document equivalence (DOI in the examples above) and also metadata that would suggest we should not perform document equivalence (rel=canonical in the examples above), how do we decide which metadata to use?
- What is a reasonable way for us to continue to respect document equivalence for DOI, and yet differentiate versions of documents which should not be equivalent?
- Is this something we need to build, or is there something the partner sites can do to prevent document equivalence when they don't want it performed?
I suppose outside of the engineering questions, there's a larger product question of how we should react to versioning when thinking about document equivalence. For example, if we have:
- SiteA.org/docV1
- SiteA.org/docV2
- SiteB.com/doc (which is the v1 version)
- SiteC.com/doc (which is the v2 version)
How can we correctly show annotations in these different scenarios. Basically, how can we detect versioning and use that as part of the document equivalence decisions?