Skip to content

Extend runCheck() to accept schema.org JSON metadata#9

Draft
lindsayplatt wants to merge 18 commits intoNCEAS:developfrom
lindsayplatt:issue08_lplatt_runCheck_JSON
Draft

Extend runCheck() to accept schema.org JSON metadata#9
lindsayplatt wants to merge 18 commits intoNCEAS:developfrom
lindsayplatt:issue08_lplatt_runCheck_JSON

Conversation

@lindsayplatt
Copy link
Copy Markdown

Partially addresses #8. This is for @jeanetteclark.

Things still missing:

  1. I only added selectors for the schema.org JSON to the example dataset_title_length-check.xml. Others should be added so that all checks in this example suite can be run on schema.org JSON
  2. I did not add any example JSON data to inst/extdata or to the test suite.
  3. I was not sure how to handle the dialect part of the <mdq:check> documents, so the code currently skips the call to isCheckValid() during runCheck() if a JSON file is being used as the metadata. Note that I have this flagged with a TODO in the R function.

Below is the example code and attached is the example file that I have been using as I developed this approach.

hs_metadata_example_vaforest.json

# Load package by building locally on this branch
# Also need to load `jqr` unless we want to move jqr from
# `@suggests` to `@imports` in the DESCRIPTION file
library(jqr)

sysmetaXML <- system.file("extdata/example_sysmeta.xml", package = "metadig")

# I added the `schema.org` selectors to the example checks
checkXML_title <- "inst/extdata/dataset_title_length-check.xml"

# Run this check with the XML example
metadataXML <- system.file("extdata/example_EML.xml", package = "metadig")
result_xml <- runCheck(checkXML_title, metadataXML, sysmetaXML)

# Run this check with the JSON file from a published HS resource
metadataJSON <- 'hs_metadata_example_vaforest.json'
result_json <- runCheck(checkXML_title, metadataJSON, sysmetaXML)

result_xml$value
result_json$value

@lindsayplatt lindsayplatt marked this pull request as draft February 24, 2025 22:20
Copy link
Copy Markdown

@jeanetteclark jeanetteclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is awesome. We have some design decisions about where to put the jq in the checks themselves, I'll start a conversation about it. In the meantime, there is some minor stuff to fix here for now

R/runCheck.R Outdated
metadataDoc <- paste(readLines(metadataFILE), collapse=' ') # `jq` functions need JSON as string
metadataDocNoNS <- metadataDoc # Namespaces don't matter for JSON, so just make a copy
# Currently only supporting schema.org
if(!grepl('schema.org', metadataDoc))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would probably be more robust to run an actual jq expression here, checking that @context is https://schema.org as opposed to looking in the entire document for mention of schema.org

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, for sure. Also note that we already have a defined check for this in our SHACL shapes. It is quite tricky to properly interpret the schema context URIs. There's a long discussion of this in the SOSO github issues and guidance document.

</namespaces>
<namespaceAware>true</namespaceAware>
</selector>
<selector>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually when a dialect is added, the xpath is added to the selector element like this

<selector>
    <name>datasetTitle</name>
    <xpath>/resource/titles/title |
             /*/title |
             /eml/dataset/title |
             /*/identificationInfo/*/citation/CI_Citation/title
      </xpath>
  </selector>

but I'm not sure how we want to handle this with the jq queries. we could either shoehorn it into the existing structure or add an element to the schema. I'll need to think about this one

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The syntax above is a valid XPath expression. I don't think mixing a json-path expression in would make sense -- I feel that we need abother field for the json-path expression. Or maybe we need to add a more extensible query block that is repeatable (e..g, <query syntax="json-path">.name</query> or something like that. And maybe it could (or should?) include a dialect attribute that says which metadata dialect these apply to (compared to what we did before to cram all of the selctors into a single xpath expression).

Copy link
Copy Markdown

@jeanetteclark jeanetteclark Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbjones I also posted about this in the DataONE slack and agree that mixing expressions doesn't make sense. I like the idea of the repeatable query, especially when combined with the dialect, but that is a pretty major refactor and we definitely have to make sure we are backwards compatible

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, fully agree we should be careful here. At a minimum we could add the new element and make it optional and repeatable, which hopefully doesn't in ay way interfere with or change how the xpath element is used. It might not be too bad to refactor it all though, given that you'll need to introduce new logic in the Java code to handle json expressions differently from xpath. so seems like you'll need to refactor that part of the code anyways. A nice transition approach would be to 1) make xpath and query optional; 2) modify your selector processing code to extract all xpaths and json paths from both fields, and then 3) launch the queries needed to get and combine the values to pass on to the check code.

expect_error(runSuite(suiteXML, dirXML, 7, 7))
expect_error(runSuite(c(suiteXML, dirXML), dirXML, metadataXML))
expect_error(runSuite(suiteXML, c(suiteXML, dirXML), metadataXML))
expect_error(runSuite(c(suiteXML, dirXML), dirXML, metadataFILE))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the test json to the test data directory, and update this check so that it works correctly? I think it will either require restructuring that directory, or rewriting this test, or both. when I added the json the test failed I think because it is interpreting it as a check not a document to be checked

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it as an example dataset in extdata, but the issue is not that it thinks the new json file is one of the check XML files (there are still only the 4 that end up in the suite value, see below. The issue is that I have only updated the title length check XML file to have a jq command for the schema.org JSON. So all of the other checks fail on the JSON right now.

basename(suite)
[1] "dataset_title_length-check.xml"         "datatype_check.xml"                    
[3] "entity_attributes_sufficient_check.xml" "methods_present.xml"

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests won't fail for now but once we get the jq commands added to the other check XML files, we will need to remove the code I added here: 48b6e78

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants