Skip to content

Add optional property xpath to @context and specify a DTS selector scheme for WADM #281

@lueck

Description

@lueck

Imagine, we have a knowledge graph with statements about passages of texts and we are using URIs to the document endpoint to identify these passages. E.g.:

<https://example.com/api/dts/document?resource=https://coexist.org/b/john.xml&ref=John:1:3> my:predicate my:ClassX .

There is much information enclosed in the URI, e.g. which resource we have a part of and which part it is. But we do not want to parse the URI. We want statements, that describe the RDF resource which is identified by the URI.

DTS does not provide the properties and classes for formalizing the information. But there's a already an open standard for such information: the Web Annotation Data Model (WADM). By describing the RDF resource with the WADM we also get an alignment with CIDOC-CRM, at least if we follow the proposal of the LINCS project.

In terms of the WADM, a part of a document as is returned by the document endpoint, is a specific resource. And a derived view as returned by specifying the mediaType parameter is also a specific resource (cf. WADM TR, Sec. 4):

While it is possible using only the constructions described above to create Annotations that reference parts of resources by using IRIs with a fragment component, there are many situations when this is not sufficient. For example, even a simple circular region of an image, or a diagonal line across it, are not possible. Selecting an arbitrary span of text in an HTML page, perhaps the simplest annotation concept, is also not supported by fragments. Furthermore, there are non-segment use cases that require a client to retrieve a specific state or representation of the resource, to style it in a particular way, to associate a role with the resource that is specific to the Annotation's use of it, or for the Annotation to only apply when the resource is used in a particular context.

The Web Annotation Data Model uses a new type of resource to capture these Annotation-specific requirements: a SpecificResource.

How would we describe a the verse John:1:3 from book of John in https://coexist.org/b/john.xml? In WADM, such a partial resource is a specific resource, which has two important properties: the source (identified by the URI to the whole resource) and a selector, that describes the passage by some selection mechanism.

There is no fixed set of selection mechanisms in WADM. Of course, it offers lots of options, e.g. the oa:XPathSelector for DOM-based documents. We can also specify the DTS selection mechanism and provide alternative selectors, that both describe the same portion of the document.

Multiple Selectors can be given to describe the same Segment in different ways in order to maximize the chances that it will be discoverable later, and that the consuming user agent will be able to use at least one of the Selectors. WADM 4.2

Here's how it could look like when we describe the partial resource in two ways:

@prefix dts: <https://w3id.org/dts/api#> .
@prefix oa: <http://www.w3.org/ns/oa#> .
@prefix my: <...> .

<https://example.com/api/dts/document?resource=https://coexist.org/b/john.xml&ref=John:1:3>
    a oa:SpecificResource ;
    oa:hasSource <https://coexist.org/b/john.xml> ;
    oa:hasSelector [
        a oa:FragmentSelector
        dcterms:conformsTo <https://w3id.org/dts/api#> ;
        rdf:value "tree=wadm&ref=John:1:3"
        ] ;
    oa:hasSelector [
        a oa:XPathSelector ;
        rdf:value "/Q{http://www.tei-c.org/ns/1.0}TEI[1]/Q{http://www.tei-c.org/ns/1.0}text[1]/Q{http://www.tei-c.org/ns/1.0}body[1]/Q{http://www.tei-c.org/ns/1.0}lg[1]/Q{http://www.tei-c.org/ns/1.0}lg[1]/Q{http://www.tei-c.org/ns/1.0}l[3]" ;
        ] ;
        
    # our analytical assertions. They might be formalized a bit
    # different, but thats not the point here.
    my:predicate my:ClassX .

Note the first selector, that describes the text passage in the style, that we know from the DTS specifications. That's what I would suggest. We could also go with this DTS selector alone, however, I would consider it a bit obscure and not interoperable enough.

The proposed syntax in the rdf:value property tree=wadm&ref=John:1:3 is borrowed from RFC5147, which is used for plain text selectors in the WADM.

How can we generate such a RDF-based description of a document part? Do we need an other endpoint? Good news: No. We can get it from the LOD returned by the navigation endpoint by applying a SPARQL construct query on it. We need some parameters for the SPARQL query, but a client knows them already from his query to the document endpoint: 1) the query URL, 2) the citation tree, and 3) the ref parameter (or start and end).

Everyting else needed can be provided in the citation tree, especially the value for the XPathSelector:

<refsDecl n="wadm" default="false">
  <citeStructure unit="book" match="//body/lg" use="@n">
	<citeData use="path(.)" property="https://w3id.org/dts/api#xpath"/>
    <citeStructure unit="chapter" match="lg" use="@n" delim=":">
      <citeData use="path(.)" property="https://w3id.org/dts/api#xpath"/>
      <citeStructure unit="verse" match="l" use="@n" delim=":">
        <citeData use="path(.)" property="https://w3id.org/dts/api#xpath"/>
      </citeStructure>
    </citeStructure>
  </citeStructure>
</refsDecl>

The members of the wadm citation would look like this:

  "member": [
    {
      "level": 1,
      "xpath": "/Q{http://www.tei-c.org/ns/1.0}TEI[1]/Q{http://www.tei-c.org/ns/1.0}text[1]/Q{http://www.tei-c.org/ns/1.0}body[1]/Q{http://www.tei-c.org/ns/1.0}lg[1]",
      "identifier": "John",
      "parent": null,
      "citeType": "book",
      "@type": "CitableUnit"
    },
    {
      "level": 2,
      "xpath": "/Q{http://www.tei-c.org/ns/1.0}TEI[1]/Q{http://www.tei-c.org/ns/1.0}text[1]/Q{http://www.tei-c.org/ns/1.0}body[1]/Q{http://www.tei-c.org/ns/1.0}lg[1]/Q{http://www.tei-c.org/ns/1.0}lg[1]",
      "identifier": "John:1",
      "parent": "John",
      "citeType": "chapter",
      "@type": "CitableUnit"
    },
    {
      "level": 3,
      "xpath": "/Q{http://www.tei-c.org/ns/1.0}TEI[1]/Q{http://www.tei-c.org/ns/1.0}text[1]/Q{http://www.tei-c.org/ns/1.0}body[1]/Q{http://www.tei-c.org/ns/1.0}lg[1]/Q{http://www.tei-c.org/ns/1.0}lg[1]/Q{http://www.tei-c.org/ns/1.0}l[1]",
      "identifier": "John:1:1",
      "parent": "John:1",
      "citeType": "verse",
      "@type": "CitableUnit"
    },
    {
      "level": 3,
      "xpath": "/Q{http://www.tei-c.org/ns/1.0}TEI[1]/Q{http://www.tei-c.org/ns/1.0}text[1]/Q{http://www.tei-c.org/ns/1.0}body[1]/Q{http://www.tei-c.org/ns/1.0}lg[1]/Q{http://www.tei-c.org/ns/1.0}lg[1]/Q{http://www.tei-c.org/ns/1.0}l[2]",
      "identifier": "John:1:2",
      "parent": "John:1",
      "citeType": "verse",
      "@type": "CitableUnit"
    },
    {
      "level": 3,
      "xpath": "/Q{http://www.tei-c.org/ns/1.0}TEI[1]/Q{http://www.tei-c.org/ns/1.0}text[1]/Q{http://www.tei-c.org/ns/1.0}body[1]/Q{http://www.tei-c.org/ns/1.0}lg[1]/Q{http://www.tei-c.org/ns/1.0}lg[1]/Q{http://www.tei-c.org/ns/1.0}l[3]",
      "identifier": "John:1:3",
      "parent": "John:1",
      "citeType": "verse",
      "@type": "CitableUnit"
    },
	  /* ... */

Here's the SPARQL query:

# SPARQL for constructing a WADM selector for the output of the
# document endpoint queried with a ref parameter. The input graph must
# be the output of a navigation endpoint for the same citation tree of
# the same resource.
#
# Parameters to be set:
#
# ?PARAMTREE - the label of the citation tree, empty string for default
# ?PARAMREF - the identifier member passed to the document enpoint as ref parameter
# ?PARAMURI - the document query URL 

PREFIX dts: <https://w3id.org/dts/api#>
PREFIX oa:  <http://www.w3.org/ns/oa#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>


CONSTRUCT {
  ?PARAMURI rdf:type oa:SpecificResource .
  ?PARAMURI oa:hasSource ?resource .
  ?PARAMURI oa:hasSelector _:xps .
  _:xps rdf:type oa:XPathSelector .
  _:xps rdf:value ?xpath .
  ?PARAMURI oa:hasSelector _:fgs .
  _:fgs rdf:type oa:FragmentSelector .
  _:fgs dcterms:conformsTo dts: .
  _:fgs rdf:value ?DTSSEL .

  # _:fgs dts:isMember _:m .
  # _:m dts:identifier ?PARAMREF .
  # _:m dts:citeType ?citeType .
  # _:m rdf:type dts:CiteableUnit .
  # _:m dts:fromTree ?PARAMTREE .
  # _:m dts:level ?level .
  # _:m dts:parent ?parent .

  ?PARAMURI dts:citeType ?citeType .
}

WHERE {

  # ?PARAM* must be passed in as parameters
  BIND("wadm" as ?PARAMTREE) . # empty value means the default tree?
  BIND("John:1:3" as ?PARAMREF) .
  BIND(<https://example.com/api/dts/document?resource=https://coexist.org/b/john.xml&ref=John:1:3> as ?PARAMURI) .


  BIND(CONCAT("tree=", STR($PARAMTREE), "&ref=", STR(?PARAMREF)) as ?DTSSEL) .

  ?resource rdf:type dts:Resource .
  ?member rdf:type dts:CitableUnit .
  ?member dts:identifier ?PARAMREF .
  ?member dts:xpath ?xpath .
  ?member dts:citeType ?citeType .
  ?member dts:level ?level .
  ?member dts:parent ?parent .

}

If we uncomment the commented lines, we would also get the information enclosed in tree=wadm&ref=John:1:3 in a more 'atomic' way. Portions of text queried with start+end would require an other SPARQL, that constructs a oa:RangeSelector.

There are working code examples in the DTS Transformation's WIKI.

DTS and WADM share some important characteristics:

  1. continuous ranges: In DTS we get a continuous portion of the document, no matter if we query it by ref or by start and end. AFAIS a WADM selector also selects a continuous range. I think, that's an important constraint and we should be able to lift its productive potential rather than underline its limitations.
  2. discontinuous ranges: In DTS multiple queries have to be filed for getting discontinuous, disconnected parts. In WADM, disconnected portions would be described by multiple specific resources.
  3. preimage - image: In DTS the resource and in the WADM the source have identifier which is mapped to the full document in a base format. It's a preimage (Urbild) in a mathematical sense. Portions and derivations to other media types are images (Bild) in a mathematical sense.

WADM selectors can be further refined, in order to select are more specific portion. That's done by oa:refinedBy and there are several refinement mechanisms, e.g., quote, string index, or even XPath again. A specific resource described by a refined selector should have a different URI than with an un-refined selector.

What specification work would need to be done for such an alignment with WADM?

  1. Allow optional xpath as a property of a CitableUnit object and have it in the dts namespace. It's value should be a path expression.
  2. Specify the DTS as a WADM selector scheme. This can be done outside of the specifications of the endpoints and is independent work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions