Skip to content

Commit

Permalink
initial public commit
Browse files Browse the repository at this point in the history
  • Loading branch information
samuelbroscheit committed Nov 9, 2020
1 parent fb873e8 commit 6986e3e
Show file tree
Hide file tree
Showing 40 changed files with 7,843 additions and 0 deletions.
163 changes: 163 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@
# Created by .ignore support plugin (hsz.mobi)
### JetBrains template
# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio and WebStorm
# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839

# User-specific stuff
.idea/**/workspace.xml
.idea/**/tasks.xml
.idea/**/dictionaries
.idea/**/shelf

# Sensitive or high-churn files
.idea/**/dataSources/
.idea/**/dataSources.ids
.idea/**/dataSources.local.xml
.idea/**/sqlDataSources.xml
.idea/**/dynamic.xml
.idea/**/uiDesigner.xml

# Gradle
.idea/**/gradle.xml
.idea/**/libraries

# CMake
cmake-build-debug/
cmake-build-release/

# Mongo Explorer plugin
.idea/**/mongoSettings.xml

# File-based project format
*.iws

# IntelliJ
out/

# mpeltonen/sbt-idea plugin
.idea_modules/

# JIRA plugin
atlassian-ide-plugin.xml

# Cursive Clojure plugin
.idea/replstate.xml

# Crashlytics plugin (for Android Studio and IntelliJ)
com_crashlytics_export_strings.xml
crashlytics.properties
crashlytics-build.properties
fabric.properties

# Editor-based Rest Client
.idea/httpRequests
### Python template
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/

.idea/
*.avro

ignore
193 changes: 193 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
# Can We Predict New Facts with Open Knowledge Graph Embeddings? A Benchmark for Open Link Prediction

This repository contains the code for the ACL 2020 paper [**"Can We Predict New Facts with Open Knowledge Graph Embeddings? A Benchmark for Open Link Prediction"**](https://www.aclweb.org/anthology/2020.acl-main.209/). The code is provided as a documentation for the paper and also for follow-up research.

# <p align="center"> <img src="docs/lp_vs_olp.png" alt="link prediction vs open link prediction" width="70%"> </p>

The content of this page covers the following topics:

1. [Preparation and Installation](#preparation-and-installation)
2. [Training Open Knowledge Graph Embedding Model on OLPBENCH](#training)
3. [Issues and possible improvements](#issues-and-possible-improvements)
















## Preparation and Installation

- The project is installed as follows:

```
git clone https://github.com/samuelbroscheit/open_link_prediction_benchmark.git
cd open_link_prediction_benchmark
pip install -r requirements.txt
```
- Add paths to environment
```
source setup_paths
```
- Download OLPBENCH
Download the full dataset (compressed: ~ 2.4 GB, uncompressed: ~ 7.9 GB)
```
cd data
wget http://data.dws.informatik.uni-mannheim.de/olpbench/olpbench.tar.gz
tar xzf olpbench.tar.gz
cd ..
```
- Download OPIEC
**Only** if you want to recreate OLPBENCH from scratch!
Download the OPIEC clean dataset (compressed: ~ 35 GB, uncompressed: ~ 292.4 GB)
```
cd data
wget http://data.dws.informatik.uni-mannheim.de/opiec/OPIEC-Clean.zip
unzip OPIEC-Clean.zip
cd ..
```
Then download and start and Elasticsearch server, that should listen on localhost:9200 . This is usually as easy as downloading the most recent version, unzip it, change the default configuration to
```
cluster node.local: true # disable network
```
and then start the server in with ./bin/elasticsearch. Then run the preprocessing with
```
python scripts/create_data.py -c config/preprocessing/prototype.yaml
```
There are two configurations prepared
- [config/preprocessing/prototype.yaml](config/preprocessing/prototype.yaml) a configuration for prototyping
- [config/preprocessing/acl2020.yaml](config/preprocessing/acl2020.yaml) the configurations with the settings fomr the ACL2020 study
## Training
Once preparation and installation are finished you can train a model on OLPBENCH.
1. [Run training](#run-training)
2. [Prepared configurations](#prepared-configurations)
3. [Available options](#available-options)
### Run training
Run the training with:
```
python scripts/train.py [TRAIN_CONFIG_YAML] [OPTIONS]
```
TRAIN_CONFIG_YAML is a yaml config file. The possible options are documented in:
[openkge/default.yaml](openkge/default.yaml)
All top level options can also be given set on the command line.
### Run evaluation
Run evaluation on test data with:
```
python scripts/train.py --resume data/experiments/.../checkpoint.pth.tar --evaluate True --evaluate_on_validation False
```
_--resume_ epects the path to a checkpoint file
_--evaluate_on_validation False_ sets the evaluation to run on test data
### Prepared configurations
In the config folder you will find the following configurations:
- [config/acl2020-openlink/wikiopenlink-thorough-complex-lstm.yaml](config/acl2020-openlink/wikiopenlink-thorough-complex-lstm.yaml) is a configuration to train a OpenKGE model on the open link benchmark data.
### Models
###### Lookup based models (standard KGE)
- LookupTucker3RelationModel
- LookupDistmultRelationModel
- LookupComplexRelationModel
###### Token based models
*Compute the entity and relation embeddings by pooling token embeddings*
- UnigramPoolingComplexRelationModel
*Compute the entity and relation embeddings with a sliding window CNN*
- BigramPoolingComplexRelationModel
*Compute the entity and relation embeddings with a LSTM*
- LSTMDistmultRelationModel
- LSTMComplexRelationModel
- LSTMTucker3RelationModel
###### Diagnostic models
- DataBiasOnlyEntityModel
- DataBiasOnlyRelationModel
For model options see the init of the respective class. Additional combinations of score and embedding functions can be easily created:
```
class BigramPoolingDistmultRelationModel(DistmultRelationScorer, BigramPoolingRelationEmbedder):

def __init__(self, **kwargs):
super().__init__(**kwargs)
```
## Citation
if you find this code useful for your research please cite
```
@inproceedings{broscheit-etal-2020-predict,
title = "Can We Predict New Facts with Open Knowledge Graph Embeddings? A Benchmark for Open Link Prediction",
author = "Broscheit, Samuel and
Gashteovski, Kiril and
Wang, Yanjie and
Gemulla, Rainer",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.209",
doi = "10.18653/v1/2020.acl-main.209",
pages = "2296--2308",
}
```
Loading

0 comments on commit 6986e3e

Please sign in to comment.