Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid resource maps can be submitted to Metacat without error #1981

Open
robyngit opened this issue Oct 8, 2024 · 3 comments
Open

Invalid resource maps can be submitted to Metacat without error #1981

robyngit opened this issue Oct 8, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@robyngit
Copy link
Member

robyngit commented Oct 8, 2024

While investigating our collection of submission errors in MetacatUI, I discovered that it's possible submit an invalid resource map to Metacat and receive a 200 status without any error. I would expect that resource maps would be validated in the same way that sysmeta and EML objects are.

Here's a reproducible example:

  1. Create an invalid resource map. In my case, I saved a resource_map.xml file with the following text:
    <?xml version="1.0" encod

  2. Create sysmeta for the object. I used this sysmeta_template.rdf.xml:

<d1_v2.0:systemMetadata xmlns:d1_v2.0="http://ns.dataone.org/service/types/v2.0"
  xmlns:d1="http://ns.dataone.org/service/types/v1">
  <serialVersion>0</serialVersion>
  <identifier>RESOURCE MAP ID HERE</identifier>
  <formatId>http://www.openarchives.org/ore/terms</formatId>
  <size>25</size>
  <checksum algorithm="MD5">9614dd15192a58ae2a91a6243e70a992</checksum>
  <submitter>http://orcid.org/0000-0002-1615-3963</submitter>
  <rightsHolder>http://orcid.org/0000-0002-1615-3963</rightsHolder>
  <accessPolicy>
    <allow>
      <subject>public</subject>
      <permission>read</permission>
    </allow>
    <allow>
      <subject>CN=arctic-data-admins,DC=dataone,DC=org</subject>
      <permission>read</permission>
      <permission>write</permission>
      <permission>changePermission</permission>
    </allow>
  </accessPolicy>
  <fileName>resource_map.xml</fileName>
</d1_v2.0:systemMetadata>

You'll want to the submitter to your ORCID

  1. Generate a PID, update the sysmeta template, then upload the resource map + sysmeta to a test node:
# 1. Set your token
TOKEN="your-token-here"

# 2. Generate the pid
PID="resource_map_urn:uuid:$(uuidgen)"

# 3. Make a copy of the sysmeta with the new PID
cp sysmeta_template.rdf.xml sysmeta.rdf.xml
sed -i '' "s/RESOURCE MAP ID HERE/$PID/" sysmeta.rdf.xml

echo "\nUploading bad resource map with PID: $PID"

echo "\nResource Map:\n"
cat resource_map.xml

echo "\n\nSysmeta:\n"
cat sysmeta.rdf.xml

echo "\n\n\n OUTPUT FROM CURL COMMAND: \n"

/opt/homebrew/opt/curl/bin/curl -i \
  -X POST \
  -H "Accept: */*" \
  -H "Authorization: Bearer $TOKEN" \
  -F "pid=$PID" \
  -F "[email protected];type=application/xml" \
  -F "object=@resource_map.xml;type=application/xml" \
  "https://dev.nceas.ucsb.edu/knb/d1/mn/v2/object"

echo "\n\n Done"
  1. See that the server returns a HTTP/1.1 200 200 status along with the the PID for the resource map:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<identifier xmlns="http://ns.dataone.org/service/types/v1">resource_map_urn:uuid:7286F53A-D29B-4087-9CE8-DEE244EEE5F6</identifier>

The file then exists on the server, but of course, is not really a resource map, e.g.


Here's the code above as downloadable files (just remember to remove the .txt).
sysmeta_template.rdf.xml.txt
create_res_map.sh.txt
resource_map.xml.txt

@robyngit robyngit added the bug Something isn't working label Oct 8, 2024
@mbjones
Copy link
Member

mbjones commented Oct 8, 2024

Metacat only validates selected data formats based on their formatId. As far as I know, only XML metadata documents are validated, and then only if they have an XML schema registered with Metacat for that document format. We've talked about adding a SHACL validator for RDF resource maps, but haven't done so to date. As RDF is an open world model, and any triples you want can be added, its hard to say what the right schema to enforce would be. I suppose enforcing the bare minimum structure would make sense -- e.g., that there is a ore:ResourceMap with an ore:Aggregation, and that each member of the aggregation has a dc:identifier. DataONE lists its resource requirements here: https://dataoneorg.github.io/api-documentation/design/DataPackage.html#generating-resource-maps

So from those DataONE rules linked above, the items to validate might include:

  1. Document is well-formed RDF
  2. all DataONE objects in the map MUST be expressed as a URI using DataONE’s resolving service
  3. The graph MUST contain an ore:ResourceMap and an ore:Aggregation
  4. The resource map MUST assert a triple with the ore:describes/ore:isDescribedBy relationship between the resource map and the aggregation
  5. Each DataONE object in the aggregation MUST be described with an dcterms:identifier field containing the DataONE identifier.
  6. when expressing an identifier in a URI, it must be URL encoded. When expressing in the dcterms:identifier field, it must not. (Of course any XML encoding would need to be applied as well, in the example below, there is none needed).

Here's what a minimal resource map might contain if the package has one metadata object and one data object and follows these rules:

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix cito: <http://purl.org/spar/cito/> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix dcterms: <http://purl.org/dc/terms/> .
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix ore: <http://www.openarchives.org/ore/terms/> .
@prefix prov: <http://www.w3.org/ns/prov#> .
@prefix provone: <http://purl.dataone.org/provone/2015/01/15/ontology#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix dataone: <https://cn.dataone.org/cn/v2/resolve/> .

<dataone:METADATA_ID>
    dcterms:identifier "METADATA_ID"^^xsd:string ;
    cito:documents <dataone:METADATA_ID>, <dataone:DATAOBJ_ID> ;
    cito:isDocumentedBy <dataone:METADATA_ID> ;
    ore:isAggregatedBy <dataone:RESOURCE_MAP_ID#aggregation> .

<dataone:RESOURCE_MAP_ID>
    dcterms:creator [
        a dcterms:Agent ;
        foaf:name "DataONE R Client"^^xsd:string
    ] ;
    dcterms:identifier "RESOURCE_MAP_ID"^^xsd:string ;
    dcterms:modified "2024-10-08T20:24:47Z"^^xsd:dateTime ;
    ore:describes <dataone:RESOURCE_MAP_ID#aggregation> ;
    a ore:ResourceMap .

<dataone:RESOURCE_MAP_ID#aggregation>
    dc:title "DataONE Aggregation" ;
    ore:aggregates <dataone:METADATA_ID>, <dataone:DATAOBJ_ID> ;
    a ore:Aggregation .

<dataone:DATAOBJ_ID>
    dcterms:identifier "DATAOBJ_ID"^^xsd:string ;
    cito:isDocumentedBy <dataone:METADATA_ID> ;
    ore:isAggregatedBy <dataone:RESOURCE_MAP_ID#aggregation> .
flowchart TD
    A(ore:ResourceMap RESOURCE_MAP_ID) -->|ore:describes| B(ore:Aggregation)
    B --> |ore:aggregates| C(METADATA_ID)
    B --> |ore:aggregates| D(DATAOBJ_ID)
    C --> |cito:documents| C
    C --> |cito:documents| D
Loading

So, we'd need SHACL rules for those conditions listed above. Would that be sufficient? Also, how would we deal with RMs that are currently in the system but are not valid according to those rules?

@robyngit
Copy link
Member Author

We've talked about adding a SHACL validator for RDF resource maps, but haven't done so to date.

@mbjones the DataONE docs indicate that resource maps go through a validation process:

Because DataONE indexing relies on the integrity of the resource maps it receives from the member nodes, each resource map will be validated against the set of constraints enumerated above. Resource maps that do not validate will fail synchronization, and the exception returned to the member node via the method MN_Read.syncrhonizationFailed.

From: Data Packaging: Resource map validation

Are the checks you described above included in this validation? Is there anyway for MetacatUI to access the MN_Read.syncrhonizationFailed error?

@mbjones
Copy link
Member

mbjones commented Oct 21, 2024

IIRC, that is part of the CN Synchronization process, and not a Metacat-based validator. So your RM gets saved to Metacat, and then at some later time in the future (minutes, hours, days) it is synced to the CN, during which a MN_Read.syncrhonizationFailed might be called (but gets fairly silently logged as an async process). Not ideal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants