We use the gradle shadowJar plugin to build the project.
gradle shadowJar
- Make sure to run
python word2vec_server.py
before (this will use port 10000 by default). - Ensure your database properties are set correctly in
config.properties
in the main project folder. If this doesn't exist, you should create it by copyingconfig.properties.example
. - a functioning MySQL instance with necessary data from each dataset pre-loaded. For example, have MySQL up and running, then create database name
mas
and follow instructions on the MAS dataset README.
TemplarCV
- Runs a cross-validation test on a specific dataset given some parameters.
After building, we can run:
java -cp build/libs/templar-all.jar edu.umich.templar.TemplarCV <dataset> <log_level> <log_join_on>
Choices for each argument:
<dataset>
:mas
,yelp
,imdb
<log_level>
:full
,no_const
,no_const_op
<log_join_on>
:true
,false
Since a lot of keywords are frequently reused in each dataset, we implemented a cache to speed up testing. This can be enabled/disabled by changing the setting for ENABLE_CACHE
in the edu.umich.templar.main.settings.Params
.
These caches will be saved in data/<dataset>/<dataset>.cands.cache
, so to clear the cache, just delete these files.
In order to add new datasets, you need to
- Load the dataset with name
<dataset>
into MySQL. - Create the folder
data/<dataset>
. Each dataset is required to have the following files (see existing datasets for examples):<dataset>_keywords.csv
: pre-parsed keywords, metadata, and answers. See other datasets for examples. Note specifically that we allow multiple correct answers, separated by semicolons, and that pairs are given in comma-separated form. This formatting matters because our accuracy evaluation is done via string comparison.<dataset>_joins.csv
: correct join paths for each query. These are in a nested, parenthetical format, where the first table alphabetically is always the first, then a table's children is given by parentheses after it, and multiple children of a tree are separated by commas. For example,author(organization,writes(publication))
is a join path whereauthor
is the first alphabetical table name, then its children areorganization
andwrites
, and thenwrites
haspublication
as a child. This formatting matters because our accuracy evaluation is done via string comparison.<dataset>_all.sqls
: the correct SQL labels for each NLQ, one query per line. This is fed in as our query log.<dataset>.fkpk.json
: a JSON file listing all the foreign key-primary key relationships in the schema<dataset>.main_attrs.json
: defining the main/display/default attributes for each relation<dataset>.proj_attrs.json
: defining the paired attributes for each relation