The DataTracer library incorporates a REST API that enables interaction with the DataTracer Solvers via HTTP communication.
The current version of this REST api supports the following functionalities:
- Primary Key Detection
- Foreign Key Detection
- Column Mapping
- Update Metadata
Starting the REST API locally in development mode is as simple as executing a single command.
Once you have installed datatracer
in your system, just open a console and run:
datatracer api
You will see a message similar to this in your console:
################################
Starting the DataTracer REST API
################################
Serving on 127.0.0.1:8000...
After this point, the application is ready to start receiving HTTP requests at port 8000
The DataTracer API has a few arguments that can be tweaked by writing a YAML configuration
file and passing it to the datatracer api
call.
The configurable options are:
host
: IP address at which the API will be listening.port
: Port address at which the API will be listening.primary_key_solver
: Name of the solver to use for primary key detection.foreign_key_solver
: Name of the solver to use for foreign key detection.column_map_solver
: Name of the solver to use for column mapping.column_map_threshold
: Minimum relevance required to include a field in the column mapping.
You will find an example of such a YAML file with the default values already set in this same directory: config.yml
If you want to modify any of these values edit the file and then make sure to pass the file
to the datatracer api
command as follows:
datatracer api -c config.yml
Over the endpoints description we will be mentioning the concepts table specification and table identifier.
A table specification consists of a dictionary that contains at least one of these two fields:
path
: Path to a CSV filedata
: CSV data passed as a string.
And then one or more additional fields that uniquely identify this table,
such as id
, name
, system
...
This is an example of a table specification that uses a path
:
{
"path": "path/to/my.csv",
"id": 1234,
"name": "my_table",
"system": "my_system"
}
And this is another example that uses data
instead of path
:
{
"data": "a_field,another_field\n1,4\n2,5\n3,6\n",
"id": 4567,
"name": "my_other_table",
}
A table identifier consists of a dictionary that is the same as the
corresponding table specification but with the data
field removed.
When working with multiple table specifications, the table identifier uniquely identifies one of them.
For example, these would be valid table identifiers for the table specifications showed above:
[
{
"path": "path/to/my.csv",
"id": 1234,
"name": "my_table",
"system": "my_system"
},
{
"id": 4567,
"name": "my_other_table"
}
]
Table specifications will be used in the API inputs every time data
needs to be passed, and table identifiers will be always used in the
API responses to refer at the corresponding inputted table, as well as in
some requests, like column_mapping
, which need to refer to one of the
inputted tables somewhere else in the request.
This endpoint uses a pre-trained DataTracer solver to solve a Primary Key detection problem.
The input to this endpoint is just a field called tables
which contains
a list of table specifications.
The output is a field called primary_keys
which contains the corresponding
table identifiers with the primary_key
field added to each one of them.
{
"tables": [
{
"id": 1234,
"name": "a_table",
"path": "a/path/to/a.csv"
},
{
"id": 4567,
"name": "another_table",
"path": "a/path/to/another.csv"
},
],
}
{
"primary_keys": [
{
"id": 1234,
"name": "a_table",
"path": "a/path/to/another.csv",
"primary_key": "a_field_name"
},
{
"id": 4567,
"name": "another_table",
"path": "a/path/to/a.csv",
"primary_key": "another_field_name"
},
],
}
This endpoint uses a pre-trained DataTracer solver to solve a Foreign Key detection problem.
The input to this endpoint is just a field called tables
which contains
a list of table specifications.
The output is a field called foreign_keys
which contains a list of
foreign key specifications.
Each foreign key specification consists of a dictionary that contains four fields:
table
: a table identifierfield
: The name of the field which is the foreign key in the giventable
.ref_table
: the table identifier of the table which is referenced by the foreign key.ref_field
: The name of the field which is referenced by the foreign key.
{
"tables": [
{
"id": 1234,
"name": "a_table",
"path": "a/path/to/a.csv"
},
{
"id": 4567,
"name": "another_table",
"path": "a/path/to/another.csv"
},
],
}
{
"foreign_keys": [
{
"table": {
"id": 4567,
"name": "another_table",
"path": "a/path/to/another.csv"
},
"field": "another_field",
"ref_table": {
"id": 1234,
"name": "a_table",
"path": "a/path/to/a.csv"
},
"ref_field": "a_field_name"
}
]
}
This endpoint uses a pre-trained DataTracer solver to solve a Column Mapping problem.
The input to this endpoint contains three fields:
tables
: list of table specificationstarget_table
: table identifier that uniquely identifies one of the passedtables
.target_field
: name of one of the fields in the giventarget_table
.
The output contains three fields:
target_table
: table identifier of the table that is being analyzed.target_field
: Name of the field that is being analyzed.column_mappings
: list of column mapping specifications.
Each column mapping specification consists of a dictionary that contains three fields:
table
: a table identifierfield
: The name of a field
{
"tables": [
{
"id": 1234,
"name": "a_table",
"path": "a/path/to/a.csv"
},
{
"id": 4567,
"name": "another_table",
"path": "a/path/to/another.csv"
},
],
"target_table": {
"id": 1234,
"name": "a_table",
},
"target_field": "a_field",
}
{
"target_table": {
"id": 1234,
"name": "a_table",
"path": "a/path/to/a.csv"
},
"target_field": "a_field",
"column_mappings": [
{
"table": {
"id": 4567,
"name": "another_table",
"path": "a/path/to/another.csv"
},
"field": "some_other_field"
},
{
"table": {
"id": 1234,
"name": "a_table",
"path": "a/path/to/a.csv"
},
"field": "some_field"
},
]
}
This endpoint updates a MetaData JSON applying the outputs from the primary_key_detection
,
foreign_key_detection
and column_mapping
endpoints.
The input to this endpoint contains two fields:
metadata
: Contents of a MetaData JSON. Alternatively, a string containing the path to a MetaData JSON can also be passed.update
: JSON specification of what two update. The contents of this JSON can be the outputs of any of the other endpoints. Combining all of them in a single JSON is also supported.
The output is the contents of the updated metadata.
In this example we are updating the metadata adding a constraint based on the column mapping output.
{
"metadata": {
"tables": [
{
"id": 1234,
"name": "a_table",
"path": "a/path/to/a.csv",
"fields": [
{
"name": "a_field",
"type": "number",
"subtype": "float"
},
{
"name": "some_field",
"type": "number",
"subtype": "float"
}
]
},
{
"id": 4567,
"name": "another_table",
"path": "a/path/to/another.csv",
"fields": [
{
"name": "another_field",
"type": "number",
"subtype": "float"
},
{
"name": "some_other_field",
"type": "number",
"subtype": "float"
}
]
},
],
},
"update": {
"target_table": {
"id": 1234,
"name": "a_table",
},
"target_field": "a_field",
"column_mappings": [
{
"table": {
"id": 4567,
"name": "another_table",
"path": "a/path/to/another.csv"
},
"field": "some_other_field"
},
{
"table": {
"id": 1234,
"name": "a_table",
"path": "a/path/to/a.csv"
},
"field": "some_field"
},
]
}
}
{
"tables": [
{
"id": 1234,
"name": "a_table",
"path": "a/path/to/a.csv",
"fields": [
{
"name": "a_field",
"type": "number",
"subtype": "float"
},
{
"name": "some_field",
"type": "number",
"subtype": "float"
}
]
},
{
"id": 4567,
"name": "another_table",
"path": "a/path/to/another.csv",
"fields": [
{
"name": "another_field",
"type": "number",
"subtype": "float"
},
{
"name": "some_other_field",
"type": "number",
"subtype": "float"
}
]
},
],
"constraints": [
{
"constraint_type": "lineage",
"fields_under_consideration": [
{
"table": "a_table",
"field": "a_field"
}
],
"related_fields": [
{
"table": "a_table",
"field": "some_field"
},
{
"table": "another_table",
"field": "some_other_field"
}
]
}
]
}
In this section we will be showing a few examples of how to interact with the DataTracer REST API using the Python requests library.
Before we start, let's make sure that we have started an instance of the DataTracer API:
datatracer api
After the API has started, let's open an interactive Python session, such as a Jupyter notebook, and start playing with it.
First we will make sure to have some data available to play with.
from datatracer import get_demo_data
get_demo_data(force=True)
Once we have generated the demo data, let's choose a dataset and prepare a list of its tables.
import os
from datatracer import load_dataset
dataset_path = 'datatracer_demo/classicmodels/'
metadata = load_dataset(dataset_path)[0]
tables = metadata.data['tables']
for table in tables:
table["path"] = os.path.join(dataset_path, table['name'] + '.csv')
This will create a list of dictionaries that contain, among other things, the names of the tables of the dataset and the paths to the corresponding CSV files:
[{'name': 'productlines',
'path': 'datatracer_demo/classicmodels/productlines.csv'},
{'name': 'payments', 'path': 'datatracer_demo/classicmodels/payments.csv'},
{'name': 'employees', 'path': 'datatracer_demo/classicmodels/employees.csv'},
{'name': 'customers', 'path': 'datatracer_demo/classicmodels/customers.csv'},
{'name': 'orders', 'path': 'datatracer_demo/classicmodels/orders.csv'},
{'name': 'offices', 'path': 'datatracer_demo/classicmodels/offices.csv'},
{'name': 'products', 'path': 'datatracer_demo/classicmodels/products.csv'},
{'name': 'orderdetails',
'path': 'datatracer_demo/classicmodels/orderdetails.csv'}]
With this list of tables we can start making requests to the DataTracer API.
Let's start by import the requests
library and passing the tables to the primary_key_detection
endpoint:
import requests
primary_keys = requests.post(
'http://localhost:8000/primary_key_detection',
json={
'tables': tables
}
).json()
This will return a list of primary keys in the format indicated above.
Similarly, we can obtain the list of foreign key candidates:
foreign_keys = requests.post(
'http://localhost:8000/foreign_key_detection',
json={
'tables': tables
}
).json()
Finally, we can try to solve a column mapping problem to obtain the list of columns which are more likely to have participated in the generation of one column which we are interested in.
For example, let's try to solve the column mapping problem for the quantityInStock
column
from the products
table:
column_mappings = requests.post(
'http://localhost:8000/column_mapping',
json={
'tables': tables,
'target_table': {
'name': 'products'
},
'target_field': 'quantityInStock'
}
).json()
The result will be a JSON indicating the column mapping for the requested field:
{
'target_table': {
'name': 'products'
},
'target_field': 'quantityInStock',
'column_mappings': [
{
'table': {
'name': 'orderdetails',
'path': 'datatracer_demo/classicmodels/orderdetails.csv'
},
'field': 'orderLineNumber'
}
]
}
Once we have all of this, we can try to update the metadata JSON with the new contents
by passing each output to the update_metadata
endpoint.
For example, let's add the column mappings obtained as a lineage constraint:
metadata_dict = metadata.data
metadata_dict = requests.post(
'http://localhost:8000/update_metadata',
json={
'metadata': metadata_dict,
'update': column_mappings
}
).json()
After this, the metadata_dict
variable will contain the metadata dictionary with a new
constraint entry indicating the relationships previously discovered by the column mapping
endpoint.