Skip to content

Understanding Vocabularies

GCHQDeveloper81 edited this page Jul 16, 2024 · 1 revision

The role of vocabularies within the semantic web ecosystem is largely the same as their role anywhere: to define a body of words that can be used make statements about a particular subject or to describe something. In the case of semantic web vocabularies, these "words" are instead referred to as terms and they help us to make unambiguous statements while conferring benefits such as increased interoperability with other data sources.

This interoperability, strangely enough, mirrors the more general benefits that vocabularies bring when we share knowledge in the real world: software engineers wouldn't get very far sharing their thoughts if they didn't have a shared language to help them understand one another! Exactly the same principle applies when adopting linked data vocabularies - it helps ensure that when you share your data, others will understand its exact meaning and will have an easier time incorporating it into their own systems.

A vocabulary could also be something that you create and publish yourself - perhaps you are operating in a very niche domain and you want to formally define how data from that domain should be described. There are also many ready-to-use vocabularies that you can incorporate into your own to save time and effort.

Consider the following data, which defines two items in a collection using off-the-shelf vocabularies.

[
    {
        "@id": "http://www.example.com/1",
        "@type": "http://dbpedia.org/resource/Table",
        "http://schema.org/name": "Table 1"
    },
    {
        "@id": "http://www.example.com/2",
        "@type": "http://www.w3.org/ns/csvw#Table",
        "http://schema.org/name": "Table 2"
    }
]

Here, we have used the JSON-LD format to serialise the data and have brought in several terms from different vocabularies. If you're confused about the presence of web addresses here, they're IRIs and are a core part of how the semantic web ecosystem ensures that terms are universally unique (because web addresses are also universally unique). See the RDF Guide for more detail.

Let's take a closer look at what was going on with the above data.

  • We've used the @id field (from JSON-LD syntax) to indicate a universally unique identifier for each item in the collection. No other database in the world will have these IDs unless they are talking about exactly the same thing as we are here - the use of our bespoke namespace (in this case, example.com) ensures this.
  • We've used the @type field (also from JSON-LD syntax) to indicate that both of these items are types of table. We've used two universally unique identifiers from two separate vocabularies (dbpedia and CSVW) to explicitly define what we mean when we say "table". They are, in fact, two very different types of table, and by using these vocabularies we've captured that fact perfectly.
  • We've used the schema.org vocabulary's definition for name to give each item a name.

By doing this, we have completely removed any ambiguity from our data and anyone who views it has all the information they need to know that the first item in the list refers to something you'd find in your dining room, while the second refers to something you might find in an Excel spreadsheet - the @type field conveys this meaning because it points to terms within vocabularies that define that meaning precisely. We have made our data into a more general purpose, application-agnostic artefact, hence making it more interoperable and easier to build automation around.

As an aside to all of this, our use of JSON-LD means the data is compatible with the RDF data model, making it super flexible and easy to extend, as well as meaning that no up-front or ongoing discussion or collaboration needs to happen about schemas. It's worth noting also that the use of vocabularies does not necessarily mean that you need to use the RDF data model - other formats such as the equally under-used (JSON-Schema)[https://json-schema.org/] are compatible with vocabulary terms too.

Given all these great benefits and how little extra effort was required to get them, it seems obvious that this is the right way to provide data... and yet very few modern web applications actually use this approach. Normally, you instead end up with APIs providing data that looks more like this:

[
    {
        "id": "P12384864351",
        "name": "Table 1",
        "type": "Furniture Table"
    },
    {
        "id": "P87684244411",
        "name": "Table 2",
        "type": "Data Table"
    },
]

Whatever schema is going on behind this data is something that's been figured out internally by the data provider from first principles. It looks perfectly reasonable, but in reality this data is essentially stove-piped and the choice of keys is completely arbitrary: we have lost all of the benefits previously discussed and are creating a lot more work for anyone trying to write an integration (how many different ways are there of saying "Furniture Table"? I guess the consumer of our API is about to find out!).

Data providers seldom consider the ongoing needs of their users (the consumers of their data) and can be guilty of often taking a wild-west, anything-goes approach to data modelling. At worst, Schemas may be inconsistent and often only passively defined at the API layer through a combination of endpoints and type definitions enshrined somewhere within the application layer. If you're lucky, the data provider will have put a bit more rigour around how they communicate their schema, maybe using tools such as openAPI/swagger or a schema-first approach like GraphQL - but even those approaches do not guarantee consistency across datasets. Regardless of how well you define your schema, the benefits of having one only really extend to the point of data provision - after you've decoupled the data from its origin, you've also decoupled it from its meaning and all of that lovely schema information has been left behind.

Many of us have also played the role of the consumer and actually experienced the pain of trying to integrate data from one or more systems into our own. It's baffling that the problem persists under these conditions, given the average software engineer's penchant for problem solving. Perhaps it's because much of the solutioneering around these issues is always one step away from us - if we're providing the data, we see our responsibility as having ended once we've done the providing. If we're the consumer, we can't really have any influence over the provider because it seems completely audacious to start dictating to them how they should be providing their data: we just see it as something we have to put up with and the solution always seems like someone else's problem.

Hopefully this clarifies the motivation behind linked data vocabularies, so now let's look at how you actually go about using them.

Finding off-the-shelf vocabularies

Whilst it is possible to write your own vocabulary completely from scratch using a bespoke namespace, in reality you will normally find yourself standing on the shoulders of the giants that came before you, combining your own custom terms with those that have been created by others. Many of the terms that we all use to describe things can transcend domain specifics, so no matter how niche the subject matter you're describing, things like metadata tend to remain consistent (things like creation dates, authors, names, descriptions etc.) and there are plenty of pre-made vocabs that have you covered when it comes to stuff like that.

So where you go about finding pre-made vocabs that give you the terms you need?

There are a number of good starting points to begin your research. What you're ideally looking for is a widely used vocabulary allows you to express your ideas, with bonus points if it happens to be a W3C recommendation. When you come to exchange your data with others, these kinds of vocab are the ones that are likely to require the least effort for others to integrate because there's a higher chance the person or system you're exchanging with also made those same choices.

A great place to start with in terms of familiarisation is the RDFa Core Initial Context list. This is a set of vocabularies that are intended to come pre-defined within RDFa parsers in order to save the user from having to type them out every time they write a document. As a nice side effect, the initial context list also serves as a great entry point for discovering popular and recommended vocabularies. Most of the items in the list are general-purpose vocabularies rather than being particularly niche or domain-specific - for those kinds of vocabs, you will need to look elsewhere.

DBPedia is one of the largest collections of linked open data in the world, and can be an excellent source of discovery and inspiration for linked data vocabularies relating to real-world concepts. Not only does dbpedia contain millions of records worth of data, it also delivers a crowd-curated ontology with over 750 classes and 3000 properties which it uses to make statements about the records in the dataset. If for example you were working with data related to chemical substances, a great starting place might be be to go and see what dbpedia does! Take their page about Nitrogen dioxide - at time of writing, you can see that it has a class of dbo:ChemicalSubstance, which defines various properties a chemical substance might have - you could use all of this in your own dataset. Even better, the dbpedia ontology is publicly editable via the dbpedia mappings project, so if a particular class or property is missing, you can add it. Something to bear in mind is that while the dbpedia ontology is a curated dataset, the resource data itself is all scraped from wikipedia so you may encounter a few mistakes and misclassifications while browsing.

The Linked Open Vocabularies directory is a resource you can potentially use to find vocabularies - it was written as part of a research project to address the very problem of vocab discoverability and contains over 800 searchable vocabularies. This resource does come with a few caveats though - firstly, the application itself, as with many applications in the linked data world, is no longer being actively maintained (the last update to the codebase at time of writing was 2016) and neither is the list of vocabularies, so encountering dead links and low quality entries is commonplace.

Publishing and Managing Vocabs

If you ever get to the stage of publishing your own vocab, the W3C Semantic Web Deployment Working group prepared a note that details best practices for publishing vocabs, including established recipes and playbooks. Note that this particular guide mainly covers the publishing of vocabs, not the writing of them, but it does cover some best practices around things like the use of namespaces.

W3C also published some best practices around the management of vocabs.

Vocabularies vs Schemas vs Ontologies

One source of confusion within the semantic web ecosystem is the distinction between an "ontology" and a "vocabulary" and furthermore how either of these relate to "schemas". There are some opinions flying around about what the differences are...

  • The notion that ontologies are "more complex" than vocabularies.
  • The notion that schemas really only relate to the "shape" of data rather than its meaning
  • The notion that a vocabulary is just the language used identify ideas rather than talk about their relationship to other things.

In truth though, the distinction between these terms isn't formally defined anywhere and these terms are, for the most part, largely interchangeable.

Learn more

Like many things in the semantic web world, open data vocabularies are often documented in a very aloof manner which which can make them difficult to understand for newcomers or for people who are just browsing. This site contains a small list of vocabulary guides which briefly details some of the more popular vocabs, providing descriptions and examples.

Clone this wiki locally