Skip to content

Weaving SQL databases into graph data.

License

Notifications You must be signed in to change notification settings

occrp-attic/loom

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Loom

loom is a command-line tool that maps data from its source table structure to a common data model, defined through JSON Schema. Once data has been modeled into such objects, it is stored as a set of statements in a SQL database and eventually indexed in ElasticSearch.

loom is the backend component for spindle, a web-based graph data browser/editor tool that uses the same data model.

Design

The design goal of loom is to accept data from many different sources and in many different formats, and to integrate it into a single, coherent data model that can be used for analysis.

Data integration

Imagine, for example, trying to investigate a senior politician. You might find data about that person on WikiData, by scraping a parliament's web site, extracting data from expense reports and by looking up company ownership records in a corporate registry.

The purpose of loom is to read data from all of these sources, and to translate each source record into a (partially-filled) JSON object representing a person. This object may have nested items, such as information about party posts, company directorships etc.

loom will also provide means for de-duplicating entities, so that all the different records about the politician coming from various sources will be merged into a single, coherent and complete profile. This is made possible by splitting the information into atomic statements (often called triples or quads, learn more).

Indexing

Using statements, the information can also be re-constructed in different ways than it was originally submitted. For example, after importing a list of people with nested information about what companies they control, you could very easily invert that into a list of companies with nested information about the people who control them.

This aspect is used when indexing the data: loom will go through all entity types and try and generate a nested representation of each entity it finds. This means that if you import a record about a person that controls a company, you will end up with two indexed objects: the person (with a reference to the company) and the company (with a reference to the person).

Design constraints

The statement tables generated by loom are very similar to an RDF triple store. It may be tempting to think about using them to perform recursive graph queries (Show me all the politicians from Germany that have companies which have subsidiaries in the Cayman Islands).

This, however, would quickly lead to an explosion in the join cross-product and kill any SQL database. It is not a design goal of loom to support this. We should, however, provide an export facility to generate RDF data which can be imported into a triple store and queried there.

Configuration

All of loom is controlled via two types of configuration files: a core configuration (which sets the destination database, available data types, and the search index), and a set of source specifications (spec.yaml, which define the source database for data mapping, how to query the source data, source details and, most importantly, a mapping between source data structure and the JSON schema-defined model).

An example core configuration (passed in via the --config argument) would look like this:

# A PostgreSQL databases should be provided. It can be empty (statement
# tables will be created automatically). Since loom uses maintenance functions
# (COPY) to load data, you may need to provide superuser access to the target
# database.
database: postgresql://localhost/statements

# ElasticSearch indexing destination. The index does not need to exist prior to
# running loom.
elastic_host: localhost:9200
elastic_index: graph

# This is the schema registry, which will be used to determine short-hand
# aliases for specific types. All schemas listed here will be indexed to
# ElasticSearch.
schemas:
    company: https://schema.occrp.org/generic/company.json#

After setting these base parameters, a data source can be specified like this:

# The ODS database is expected to contain the source tables referenced below
# (fo_companies_director, fo_companies_company).
database: postgresql://localhost/source_database

# Source metadata, limited to three basic fields for the moment.
source:
    slug: foo_companies
    title: "Foo Country Company Registry"
    url: http://registry.gov.foo/companies

# Source data tables, and how to join them. Tables can also be assigned an
# alias to allow for recursive joins (i.e. for parent-child relationships).
tables:
    - fo_companies_company
    - fo_companies_director
joins:
    - foo_companies_company.id: fo_companies_director.company_id

# Outputs define the set of queries to be run, alongside with a mapping that
# is used to translate the resulting records into JSON schema objects:
outputs:
    # Every output has a name, which is used only for internal purposes:
    demo:
        schema:
            $ref: https://schema.occrp.org/generic/company.json#
        mapping:
            # Mapping for a single field in the destination schema:
            name:
                column: foo_companies_company.name
                # Some transformations can be applied:
                transforms:
                    - clean
                    - latinize
            company_id:
                columns: foo_companies_company.id
                format: 'urn:foo:%s'
            directors:
                # Nested object, which uses the same mapping method:
                mapping:
                    name:
                        column: fo_companies_director.name

The content of the mapping section of this specification file is based on a separate Python library called jsonmapping. Refer to that package's documentation for information about how to perform more complex mappings and transformations on the source data.

Installation

Before installing loom, make sure you have both dependencies - PostgreSQL and ElasticSearch - installed.

If you wish to use the loom command line tool, you can install the application into a Python virtual environment like this:

$ pip install git+https://github.com/occrp/loom

To do development on the tool, you should instead check out a local copy:

$ git clone [email protected]:occrp/loom.git
$ cd loom
$ make install
$ make test

Instead of executing the Makefile commands (which create a virtual environment and install the necessary dependencies), you can also run these steps manually.

Usage

A prerequisite for using loom is that source data needs to be stored in a PostgreSQL database. The specific layout of the source data tables is not important, since the JSON object mapping will be applied.

After installing loom, a command-line tool is available. You can use the --help argument to learn more about it's functioning.

A basic sequence of commands might look like this:

# translate the source data records into the loom data store as statements:
$ loom -c config.yaml map spec.yaml
# index the statements into ElasticSearch:
$ loom -c config.yaml index --source foo_companies
# delete statements from the data store:
$ loom -c config.yaml flush --source foo_companies

Similar work and references

loom is heavily inspired by Linked Data and RDF. If you're interested in similar tools in that ecosystem, check out:

License

loom is free software; it is distributed under the terms of the Affero General Public License, version 3.

About

Weaving SQL databases into graph data.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published