The NDE Dataset Knowledge Graph helps researchers, software developers and others to find relevant datasets for their projects. It consists of Dataset Summaries that provide statistical information about datasets.
This repository is the data pipeline that generates the Knowledge Graph.
To query the Knowledge Graph, use the SPARQL endpoint at
https://triplestore.netwerkdigitaalerfgoed.nl/repositories/dataset-knowledge-graph
.
Some example queries (make sure to select repository dataset-knowledge-graph
on the top right):
- links from datasets to terminology sources
- property partitions per class
- percentage of URI objects vs literals
This datastory shows more queries against the Knowledge Graph.
The Knowledge Graph contains Dataset Summaries that answer questions such as:
- which RDF types are used in the dataset?
- for each of those types, how many resources does the dataset contain?
- which predicates are used in the dataset?
- for each of those predicates, how many subjects have it?
- similarly, how many subjects of each type have the predicate?
- which URI prefixes does the dataset link to?
- for each of those prefixes, which match known terminology sources?
- for each of those sources, how many outgoing links to them does the dataset have?
- (and more)
The Summaries can be consulted by users such as data platform builders to help them find relevant datasets.
It is built on top of the Dataset Register, which contains dataset descriptions as supplied by their owners. Part of these descriptions are distributions, i.e. URLs where the data can be retrieved.
To build the Summaries, the Knowledge Graph Pipeline applies SPARQL queries against RDF distributions, either directly in case of SPARQL endpoints or by loading the data first in case of RDF data dumps. Where needed, the SPARQL results are post-processed in code.
This pipeline:
- is RDF-based so will be limited to datasets that provide at least one valid RDF distribution;
- will skip RDF distributions that contain invalid data.
The pipeline produces a set of Dataset Summaries. VoID is used as the data model for these Summaries.
The overall size of the dataset: the number of unique subjects, predicates and literal as well as URI objects.
<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
void:triples 6119677;
void:distinctSubjects 53434;
void:properties 943;
nde:distinctObjectsLiteral 2125;
nde:distinctObjectsURI 32323.
The RDF subject classes that occur in the dataset, and for each class, the number of instances.
<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
void:classPartition [
void:class schema:VisualArtWork;
void:entities 312000;
],
[
void:class schema:Person;
void:entities 980;
].
The predicates that occur in the dataset, and for each predicate, the number of entities that have that predicate as well as the number of distinct objects.
<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
void:propertyPartition [
void:property schema:name;
void:entities 203000; # 20.300 resources have a schema:name.
void:distinctObjects 20000; # These resources have a total of 20.000 unique names.
],
[
void:property schema:birthDate;
void:entities 19312;
void:distinctObjects 19312;
].
The predicates per subject class, and for each predicate, the number of entities that have that predicate as well as the number of distinct objects.
Nest a void:propertyPartition
in void:classPartition
:
<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
void:classPartition [
void:class schema:Person;
void:propertyPartition [
void:property schema:name; # This partition is about schema:Persons with a schema:name.
void:entities 155; # 155 persons have a name.
void:distinctObjects 205; # These 155 persons have a total of 205 unique names, because some persons have multiple names.
],
[
void:property schema:birthDate;
void:entities 76;
void:distinctObjects 76;
]
],
[
void:class schema:VisualArtWork;
void:propertyPartition [
void:property schema:name;
void:entities 1200;
void:distinctObjects 1200;
],
[
void:property schema:image;
void:entities 52;
void:distinctObjects 20;
]
].
Outgoing links to terminology sources in the Network of Terms,
modelled as void:Linkset
s:
[] a void:Linkset;
void:subjectsTarget <http://data.bibliotheken.nl/id/dataset/rise-alba>;
void:objectsTarget <http://data.bibliotheken.nl/id/dataset/persons>;
void:triples 434 .
[] a void:Linkset;
void:subjectsTarget <http://data.bibliotheken.nl/id/dataset/rise-alba>;
void:objectsTarget <https://data.cultureelerfgoed.nl/term/id/cht>;
void:triples 9402.
Uses a list of fixed URI prefixes to match against, from the Network of Terms and in addition a custom list in the pipeline itself.
The vocabularies that the dataset’s predicates refer to:
<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
void:vocabulary <http://schema.org>, <http://xmlns.com/foaf/0.1/>.
Licenses that apply to resources in the dataset.
<https://example.com/dataset> a void:Dataset;
void:subset [
dcterms:license <http://creativecommons.org/publicdomain/mark/1.0/>,
void:triples 120.
],
[
dcterms:license <http://creativecommons.org/publicdomain/mark/1.0/>,
void:triples 120.
].
All declared RDF distributions are validated:
- SPARQL endpoints are tested with a simple
SELECT * { ?s ?p ?o } LIMIT 1
query; - RDF data downloads are tested with an HTTP HEAD request.
If the distributions are valid, they are stored in void:sparqlEndpoint
and/or void:dataDump
triples:
<https://lod.uba.uva.nl/UB-UVA/Books>
void:sparqlEndpoint <https://lod.uba.uva.nl/UB-UVA/Catalogue/sparql/> ;
void:dataDump <https://lod.uba.uva.nl/_api/datasets/UB-UVA/Books/download.nt.gz?> .
The Schema.org ontology is used to supplement VoID in providing additional details about the distributions, retrieved from the HTTP HEAD response, if available:
<https://lod.uba.uva.nl/_api/datasets/UB-UVA/Books/download.nt.gz?>
<https://schema.org/dateModified> "2023-11-03T23:55:38.000Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>;
<https://schema.org/contentSize> 819617127.
[] a <https://schema.org/Action>;
<https://schema.org/target> <https://lod.uba.uva.nl/UB-UVA/Catalogue/sparql/>;
<https://schema.org/result> <https://lod.uba.uva.nl/UB-UVA/Catalogue/sparql/>.
[] a <https://schema.org/Action>;
<https://schema.org/target> <https://lod.uba.uva.nl/_api/datasets/UB-UVA/Books/download.nt.gz?>;
<https://schema.org/result> <https://lod.uba.uva.nl/_api/datasets/UB-UVA/Books/download.nt.gz?>.
If a distribution is invalid, a schema:error
triple will indicate the HTTP status code:
[] a <https://schema.org/Action>;
<https://schema.org/target> <https://www.openarchieven.nl/foundlinks/linkset/33ff3fa4744db564807b99dbc4a3d012.nt.gz>;
<https://schema.org/error> <https://www.w3.org/2011/http-statusCodes#NotFound>.
<http://data.bibliotheken.nl/id/dataset/rise-alba> a void:Dataset;
void:exampleResource <http://data.bibliotheken.nl/doc/alba/p418213178>,
<http://data.bibliotheken.nl/doc/alba/p416673600>.
To run the pipeline yourself, start by cloning this repository. Then execute:
npm install
npm run dev
The Dataset Summaries output will be written to the output/
directory.
The pipeline consists of the following steps.
Select dataset descriptions with RDF distributions from the Dataset Register.
If the dataset has no SPARQL endpoint distribution, load the data from an RDF dump distribution, if available.
Apply Analyzers, either to the dataset provider’s SPARQL endpoint, or our own where we loaded the data. Analyzers are SPARQL CONSTRUCT queries, wrapped in code where needed to extract more detailed information. Analyzers output results as triples in the VoID vocabulary.
Write the analysis results to local files and a triple store.