Extend runCheck() to accept schema.org JSON metadata#9
Extend runCheck() to accept schema.org JSON metadata#9lindsayplatt wants to merge 18 commits intoNCEAS:developfrom
runCheck() to accept schema.org JSON metadata#9Conversation
…t it accepts more than just XML docs now
jeanetteclark
left a comment
There was a problem hiding this comment.
Overall this is awesome. We have some design decisions about where to put the jq in the checks themselves, I'll start a conversation about it. In the meantime, there is some minor stuff to fix here for now
R/runCheck.R
Outdated
| metadataDoc <- paste(readLines(metadataFILE), collapse=' ') # `jq` functions need JSON as string | ||
| metadataDocNoNS <- metadataDoc # Namespaces don't matter for JSON, so just make a copy | ||
| # Currently only supporting schema.org | ||
| if(!grepl('schema.org', metadataDoc)) |
There was a problem hiding this comment.
it would probably be more robust to run an actual jq expression here, checking that @context is https://schema.org as opposed to looking in the entire document for mention of schema.org
There was a problem hiding this comment.
yes, for sure. Also note that we already have a defined check for this in our SHACL shapes. It is quite tricky to properly interpret the schema context URIs. There's a long discussion of this in the SOSO github issues and guidance document.
| </namespaces> | ||
| <namespaceAware>true</namespaceAware> | ||
| </selector> | ||
| <selector> |
There was a problem hiding this comment.
usually when a dialect is added, the xpath is added to the selector element like this
<selector>
<name>datasetTitle</name>
<xpath>/resource/titles/title |
/*/title |
/eml/dataset/title |
/*/identificationInfo/*/citation/CI_Citation/title
</xpath>
</selector>
but I'm not sure how we want to handle this with the jq queries. we could either shoehorn it into the existing structure or add an element to the schema. I'll need to think about this one
There was a problem hiding this comment.
The syntax above is a valid XPath expression. I don't think mixing a json-path expression in would make sense -- I feel that we need abother field for the json-path expression. Or maybe we need to add a more extensible query block that is repeatable (e..g, <query syntax="json-path">.name</query> or something like that. And maybe it could (or should?) include a dialect attribute that says which metadata dialect these apply to (compared to what we did before to cram all of the selctors into a single xpath expression).
There was a problem hiding this comment.
@mbjones I also posted about this in the DataONE slack and agree that mixing expressions doesn't make sense. I like the idea of the repeatable query, especially when combined with the dialect, but that is a pretty major refactor and we definitely have to make sure we are backwards compatible
There was a problem hiding this comment.
Yeah, fully agree we should be careful here. At a minimum we could add the new element and make it optional and repeatable, which hopefully doesn't in ay way interfere with or change how the xpath element is used. It might not be too bad to refactor it all though, given that you'll need to introduce new logic in the Java code to handle json expressions differently from xpath. so seems like you'll need to refactor that part of the code anyways. A nice transition approach would be to 1) make xpath and query optional; 2) modify your selector processing code to extract all xpaths and json paths from both fields, and then 3) launch the queries needed to get and combine the values to pass on to the check code.
tests/testthat/test_run_suite.R
Outdated
| expect_error(runSuite(suiteXML, dirXML, 7, 7)) | ||
| expect_error(runSuite(c(suiteXML, dirXML), dirXML, metadataXML)) | ||
| expect_error(runSuite(suiteXML, c(suiteXML, dirXML), metadataXML)) | ||
| expect_error(runSuite(c(suiteXML, dirXML), dirXML, metadataFILE)) |
There was a problem hiding this comment.
can you add the test json to the test data directory, and update this check so that it works correctly? I think it will either require restructuring that directory, or rewriting this test, or both. when I added the json the test failed I think because it is interpreting it as a check not a document to be checked
There was a problem hiding this comment.
I added it as an example dataset in extdata, but the issue is not that it thinks the new json file is one of the check XML files (there are still only the 4 that end up in the suite value, see below. The issue is that I have only updated the title length check XML file to have a jq command for the schema.org JSON. So all of the other checks fail on the JSON right now.
basename(suite)
[1] "dataset_title_length-check.xml" "datatype_check.xml"
[3] "entity_attributes_sufficient_check.xml" "methods_present.xml"
There was a problem hiding this comment.
The tests won't fail for now but once we get the jq commands added to the other check XML files, we will need to remove the code I added here: 48b6e78
…length check works
…le length check XML has the jq command added; skip it for now
Partially addresses #8. This is for @jeanetteclark.
Things still missing:
dataset_title_length-check.xml. Others should be added so that all checks in this example suite can be run on schema.org JSONinst/extdataor to the test suite.dialectpart of the<mdq:check>documents, so the code currently skips the call toisCheckValid()duringrunCheck()if a JSON file is being used as the metadata. Note that I have this flagged with aTODOin the R function.Below is the example code and attached is the example file that I have been using as I developed this approach.
hs_metadata_example_vaforest.json