loom
is a command-line tool that maps data from its source table structure
to a common data model, defined through JSON Schema.
Once data has been modeled into such objects, it is stored as a set of
statements in a SQL database and eventually indexed in ElasticSearch.
loom
is the backend component for spindle,
a web-based graph data browser/editor tool that uses the same data model.
The design goal of loom
is to accept data from many different sources and
in many different formats, and to integrate it into a single, coherent data
model that can be used for analysis.
Imagine, for example, trying to investigate a senior politician. You might find data about that person on WikiData, by scraping a parliament's web site, extracting data from expense reports and by looking up company ownership records in a corporate registry.
The purpose of loom
is to read data from all of these sources, and to
translate each source record into a (partially-filled) JSON object representing
a person. This object may have nested items, such as information about party
posts, company directorships etc.
loom
will also provide means for de-duplicating entities, so that all the
different records about the politician coming from various sources will be
merged into a single, coherent and complete profile. This is made possible by
splitting the information into atomic statements (often called triples or quads,
learn more).
Using statements, the information can also be re-constructed in different ways than it was originally submitted. For example, after importing a list of people with nested information about what companies they control, you could very easily invert that into a list of companies with nested information about the people who control them.
This aspect is used when indexing the data: loom
will go through all entity
types and try and generate a nested representation of each entity it finds.
This means that if you import a record about a person that controls a company,
you will end up with two indexed objects: the person (with a reference to the
company) and the company (with a reference to the person).
The statement tables generated by loom
are very similar to an RDF triple
store. It may be tempting to think about using them to perform recursive graph
queries (Show me all the politicians from Germany that have companies which
have subsidiaries in the Cayman Islands).
This, however, would quickly lead to an explosion in the join cross-product and
kill any SQL database. It is not a design goal of loom
to support this. We
should, however, provide an export facility to generate RDF data which can be
imported into a triple store and queried there.
All of loom
is controlled via two types of configuration files: a core
configuration (which sets the destination database, available data types, and
the search index), and a set of source specifications (spec.yaml
, which
define the source database for data mapping, how to query the source data,
source details and, most importantly, a mapping between source data
structure and the JSON schema-defined model).
An example core configuration (passed in via the --config
argument) would
look like this:
# A PostgreSQL databases should be provided. It can be empty (statement
# tables will be created automatically). Since loom uses maintenance functions
# (COPY) to load data, you may need to provide superuser access to the target
# database.
database: postgresql://localhost/statements
# ElasticSearch indexing destination. The index does not need to exist prior to
# running loom.
elastic_host: localhost:9200
elastic_index: graph
# This is the schema registry, which will be used to determine short-hand
# aliases for specific types. All schemas listed here will be indexed to
# ElasticSearch.
schemas:
company: https://schema.occrp.org/generic/company.json#
After setting these base parameters, a data source can be specified like this:
# The ODS database is expected to contain the source tables referenced below
# (fo_companies_director, fo_companies_company).
database: postgresql://localhost/source_database
# Source metadata, limited to three basic fields for the moment.
source:
slug: foo_companies
title: "Foo Country Company Registry"
url: http://registry.gov.foo/companies
# Source data tables, and how to join them. Tables can also be assigned an
# alias to allow for recursive joins (i.e. for parent-child relationships).
tables:
- fo_companies_company
- fo_companies_director
joins:
- foo_companies_company.id: fo_companies_director.company_id
# Outputs define the set of queries to be run, alongside with a mapping that
# is used to translate the resulting records into JSON schema objects:
outputs:
# Every output has a name, which is used only for internal purposes:
demo:
schema:
$ref: https://schema.occrp.org/generic/company.json#
mapping:
# Mapping for a single field in the destination schema:
name:
column: foo_companies_company.name
# Some transformations can be applied:
transforms:
- clean
- latinize
company_id:
columns: foo_companies_company.id
format: 'urn:foo:%s'
directors:
# Nested object, which uses the same mapping method:
mapping:
name:
column: fo_companies_director.name
The content of the mapping
section of this specification file is based on
a separate Python library called jsonmapping.
Refer to that package's documentation for information about how to perform more
complex mappings and transformations on the source data.
Before installing loom
, make sure you have both dependencies - PostgreSQL
and ElasticSearch - installed.
If you wish to use the loom
command line tool, you can install the
application into a Python virtual environment
like this:
$ pip install git+https://github.com/occrp/loom
To do development on the tool, you should instead check out a local copy:
$ git clone [email protected]:occrp/loom.git
$ cd loom
$ make install
$ make test
Instead of executing the Makefile
commands (which create a virtual
environment and install the necessary dependencies), you can also run these
steps manually.
A prerequisite for using loom
is that source data needs to be stored in a
PostgreSQL database. The specific layout of the source data tables is not
important, since the JSON object mapping will be applied.
After installing loom
, a command-line tool is available. You can use the
--help
argument to learn more about it's functioning.
A basic sequence of commands might look like this:
# translate the source data records into the loom data store as statements:
$ loom -c config.yaml map spec.yaml
# index the statements into ElasticSearch:
$ loom -c config.yaml index --source foo_companies
# delete statements from the data store:
$ loom -c config.yaml flush --source foo_companies
loom
is heavily inspired by Linked Data and RDF. If you're interested in
similar tools in that ecosystem, check out:
loom
is free software; it is distributed under the terms of the Affero
General Public License, version 3.