Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide long-term hosting solutions for RDF/Linked Data #81

Open
chiarcos opened this issue Mar 5, 2021 · 6 comments
Open

Provide long-term hosting solutions for RDF/Linked Data #81

chiarcos opened this issue Mar 5, 2021 · 6 comments

Comments

@chiarcos
Copy link

chiarcos commented Mar 5, 2021

This is especially for academic data and demonstrators, and this is less about technology or formalisms, but about politics.

Long-term availability is a problem for Linked Data technology in general. Infrastructure providers (e.g., libraries) can be hesitant to provide SPARQL end points, and long-term maintenance for services can probably not be expected at all. Case in point: The infamous British Museum SPARQL end point. Once hailed in Digital Humanities, but now largely defunct. There are challenges here that can be addressed on a technical level, but this is not what this issue is about.

A minimal requirement to ensure long-term (re-)usability of Linked Data and the feasility of federation (as unique selling points for RDF technology) would be to guarantee the acccessibility of the data itself, ideally in a way that allows resolvable URIs.
Indeed, major infrastructures for hosting research data have emerged in different fields, e.g., Zenodo, DSPACE, or CLARIN, and these could provide exactly that, at least for the academic sector. However, they currently do not seem to support RDF formats as mime types for deposited data (for metadata, they do). See zenodo/zenodo#1515 for the discussion on Zenodo. So, the mime type RDF data is published under is normally text/plain. This means that applications need to guess the format if they attempt to resolve URIs against a resource. This can work, but it is unreliable. In particular, it will fail if URIs do not include the file ending (as recommended, because we have content negotiation, except not here), or if the data URI carries any flags after the file ending (e.g., "...?download=1").

Example:
FAILURE (using Zenodo-provided data link)
http://www.sparql.org/sparql?query=SELECT+*%0D%0AFROM+%3Chttps%3A%2F%2Fzenodo.org%2Frecord%2F4444132%2Ffiles%2Fcrmtex.owl%3Fdownload%3D1%3E%0D%0AWHERE+%7B+%3Fa+%3Fb+%3Fc+%7D+LIMIT+10&default-graph-uri=&output=xml&stylesheet=%2Fxml-to-html.xsl

SUCCESS (skipping download flags, providing full file ending)
http://www.sparql.org/sparql?query=SELECT+*%0D%0AFROM+%3Chttps%3A%2F%2Fzenodo.org%2Frecord%2F4444132%2Ffiles%2Fcrmtex.owl%3E%0D%0AWHERE+%7B+%3Fa+%3Fb+%3Fc+%7D+LIMIT+10&default-graph-uri=&output=xml&stylesheet=%2Fxml-to-html.xsl

Actually, this is very easy to fix, we just need to petition repeatedly and massively to maintainers and developers of such infrastructures that data is declarable as text/turtle (etc.) than just text/plain.

@HughGlaser
Copy link
Collaborator

HughGlaser commented Mar 5, 2021 via email

@chiarcos
Copy link
Author

chiarcos commented Mar 5, 2021

The best way to get people to maintain Linked Data offerings is if they get used, and for things that the provider thinks are useful, and enhance things.

The point here is that we have to distinguish the provider, the host and the consumer. For most Linked Open Data, the gain is with the consumer, sometimes with the provider, but usually not with the host.

In an academic context (where much LOD is being produced), the provider (the scientist/project) has an interest in dissemination and re-use. This can help building a carreer, especially if coupled with attribution, and it's a good signal to send to funders. But the provider is not the host.

The provider could be the host: he can put it on some institute's website. Where it will remain for, say, 5 years, and then silently disappear. He has move on since, probably, and doesn't care either. Or, more likely, he just doesn't have the resources for a reasonable hosting solution. In any case, whatever project created that data will have run out of funding.

Still, there may be consumers interested in the data. This can be commercial, or it could be research. And indeed, they may continue to operate on local copies. Except that these copies aren't Linked Data anymore because they don't resolve.

The British Museum scenario is similar, but not quite the same. Their LOD portal has been a publicity or political thing, I guess, possibly encouraged by the push of Europeana to LOD (who still run their LOD infrastructure, because it's compatible with their vision). But it was never used internally, because museums, libraries, etc., tend to be conservative (for good reasons, but sometimes this means to work with pre-1990 technology, think of MARC or PICA3). However, it was used by the community. Quite a lot, as far as I can tell. which is why there are copies around (we have one, too). But they're no Linked Data anymore, as the URIs won't resolve, and they will never be again unless the British Museum decides to do something about it. But there is no gain in there, for them, so they won't. (And there is a cost associated with that, i.e., making sure that their internal catalogue and the LOD version stay synchronized. Which is also why just republishing their old data with new, resolving URIs is not a good solution either, because the data cannot be trusted anymore, and if I recall correctly, it lacks any versioning information, so we cannot even say for which point in time this snapshot was valid.)

This is one scenario, only. There are others. In my case, I struggle to find a sustainable hosting solution for some thousand bilingual dictionaries in RDF. As far as I can tell, this is the largest collection of machine-readable open source bilingual dictionaries in existence. I could get them to Zenodo or just keep them on GitHub (where there are the smaller ones of them, right now), but not with resolving URIs and/or under the right media type. Which is why they aren't Linked Data, yet. This massively reduces their potential to be re-used.

But now we actually do have hosting solutions for scientific data, except that they don't support resolvable URIs. If that doesn't change, Linked Data (in academia) is basically dead.

Which is a pity, because it's the ideal technology to achieve FAIRness (https://www.go-fair.org/fair-principles/), and this has become a major criterion in scientific grant approval. So, there is a problem (how to improve data re-use in science), there is a demand (achieving FAIRness), there is a technology (Linked Data), there even are hosting services specifically designed to support that (FAIRness), except that they don't (provide resolvable URIs, or, at least. not with the right mime type).

So we don't forget:- if the URIs are not resolvable, it isn't Linked Data. Best Hugh

Exactly my point. But the media type issue also extends to non-LOD RDF data, so this is not an exclusive LOD problem.

@afs
Copy link
Contributor

afs commented Mar 5, 2021

just text/plain

Data point: In this case it's application/octet-stream.

@namedgraph
Copy link

@chiarcos have you checked https://w3id.org/ ? I think it could work for your vocabularies.

@pchampin
Copy link

pchampin commented Jun 4, 2021

Nitipicking: w3id.org is not a host, it is a redirection service. It does not directly solve the problem @chiarcos is raising.

However, it can indeed be a part of the solution: decoupling the hosting from the naming helps make the dataset robust. If the Biritsh Museum had used w3id.org IRIs, then those IRIs could now redirect to one of the mirrors, and this would still be bona fide Linked Data. My 2¢.

@chiarcos
Copy link
Author

chiarcos commented Jun 4, 2021

Nitipicking: w3id.org ... can indeed be a part of the solution: decoupling the hosting from the naming helps make the dataset robust. If the Biritsh Museum had used w3id.org IRIs, then those IRIs could now redirect to one of the mirrors, and this would still be bona fide Linked Data. My 2¢.

True, and that's how it should work. But then, we still need either some institution (or a mechanism; not a single person) that updates the redirect if the preferred mirror fails or we need sustainable hosting. I guess Zenodo is (for example), but URIs will only then resolve in robust fashion if the right media type is declared (or the file extension/persistent URI indicates the format -- but this is a client-side solution, not necessarily robust, and often considered a design flaw from the side of data modelling). What I want is Zenodo (etc.) to support text/turtle (etc.). In combination with redirection (be it via w3id.org or another service), this is the gold solution.

As for automatically updating the redirect, DOI comes with the possibility to specify multiple redirection targets and with a selection mechanism among them, but as far as I know, real-world DOI resolution systems require manual selection it multiple URL targets are provided, so this is not a LOD-compliant solution either. (The resolution spec has more options, but "use second URL if the first fails" is not among them.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants