initial public commit

samuelbroscheit · Nov 9, 2020 · 6986e3e · 6986e3e
1 parent fb873e8
commit 6986e3e
Show file tree

Hide file tree

Showing 40 changed files with 7,843 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,163 @@
+# Created by .ignore support plugin (hsz.mobi)
+### JetBrains template
+# Covers JetBrains IDEs: IntelliJ, RubyMine, PhpStorm, AppCode, PyCharm, CLion, Android Studio and WebStorm
+# Reference: https://intellij-support.jetbrains.com/hc/en-us/articles/206544839
+
+# User-specific stuff
+.idea/**/workspace.xml
+.idea/**/tasks.xml
+.idea/**/dictionaries
+.idea/**/shelf
+
+# Sensitive or high-churn files
+.idea/**/dataSources/
+.idea/**/dataSources.ids
+.idea/**/dataSources.local.xml
+.idea/**/sqlDataSources.xml
+.idea/**/dynamic.xml
+.idea/**/uiDesigner.xml
+
+# Gradle
+.idea/**/gradle.xml
+.idea/**/libraries
+
+# CMake
+cmake-build-debug/
+cmake-build-release/
+
+# Mongo Explorer plugin
+.idea/**/mongoSettings.xml
+
+# File-based project format
+*.iws
+
+# IntelliJ
+out/
+
+# mpeltonen/sbt-idea plugin
+.idea_modules/
+
+# JIRA plugin
+atlassian-ide-plugin.xml
+
+# Cursive Clojure plugin
+.idea/replstate.xml
+
+# Crashlytics plugin (for Android Studio and IntelliJ)
+com_crashlytics_export_strings.xml
+crashlytics.properties
+crashlytics-build.properties
+fabric.properties
+
+# Editor-based Rest Client
+.idea/httpRequests
+### Python template
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+
+# C extensions
+*.so
+
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+.hypothesis/
+.pytest_cache/
+
+# Translations
+*.mo
+*.pot
+
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+
+# Flask stuff:
+instance/
+.webassets-cache
+
+# Scrapy stuff:
+.scrapy
+
+# Sphinx documentation
+docs/_build/
+
+# PyBuilder
+target/
+
+# Jupyter Notebook
+.ipynb_checkpoints
+
+# pyenv
+.python-version
+
+# celery beat schedule file
+celerybeat-schedule
+
+# SageMath parsed files
+*.sage.py
+
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+
+# Spyder project settings
+.spyderproject
+.spyproject
+
+# Rope project settings
+.ropeproject
+
+# mkdocs documentation
+/site
+
+# mypy
+.mypy_cache/
+
+.idea/
+*.avro
+
+ignore
diff --git a/README.md b/README.md
@@ -0,0 +1,193 @@
+# Can We Predict New Facts with Open Knowledge Graph Embeddings? A Benchmark for Open Link Prediction
+
+This repository contains the code for the ACL 2020 paper [**"Can We Predict New Facts with Open Knowledge Graph Embeddings? A Benchmark for Open Link Prediction"**](https://www.aclweb.org/anthology/2020.acl-main.209/). The code is provided as a documentation for the paper and also for follow-up research.
+
+# <p align="center"> <img src="docs/lp_vs_olp.png" alt="link prediction vs open link prediction" width="70%"> </p>
+
+The content of this page covers the following topics: 
+
+1. [Preparation and Installation](#preparation-and-installation)
+2. [Training Open Knowledge Graph Embedding Model on OLPBENCH](#training)
+3. [Issues and possible improvements](#issues-and-possible-improvements)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+## Preparation and Installation 
+
+- The project is installed as follows:
+
+    ```
+    git clone https://github.com/samuelbroscheit/open_link_prediction_benchmark.git
+    cd open_link_prediction_benchmark
+    pip install -r requirements.txt
+    ```
+
+- Add paths to environment
+
+    ```
+    source setup_paths
+    ```
+
+- Download OLPBENCH
+
+    Download the full dataset (compressed: ~ 2.4 GB, uncompressed: ~ 7.9 GB)
+
+    ```
+    cd data
+    wget http://data.dws.informatik.uni-mannheim.de/olpbench/olpbench.tar.gz
+    tar xzf olpbench.tar.gz
+    cd ..
+    ```
+- Download OPIEC
+
+    **Only** if you want to recreate OLPBENCH from scratch!
+
+    Download the OPIEC clean dataset (compressed: ~ 35 GB, uncompressed: ~ 292.4 GB)
+
+    ```
+    cd data
+    wget http://data.dws.informatik.uni-mannheim.de/opiec/OPIEC-Clean.zip
+    unzip OPIEC-Clean.zip
+    cd ..
+    ```
+
+    Then download and start and Elasticsearch server, that should listen on localhost:9200 . This is usually as easy as downloading the most recent version, unzip it, change the default configuration to
+
+    ```
+    cluster node.local: true # disable network
+    ```
+
+    and then start the server in with ./bin/elasticsearch. Then run the preprocessing with 
+    
+    ```
+    python scripts/create_data.py -c config/preprocessing/prototype.yaml
+    ```
+    
+    There are two configurations prepared
+
+- [config/preprocessing/prototype.yaml](config/preprocessing/prototype.yaml) a configuration for prototyping
+
+- [config/preprocessing/acl2020.yaml](config/preprocessing/acl2020.yaml) the configurations with the settings fomr the ACL2020 study
+
+
+
+
+
+
+
+
+
+## Training
+
+Once preparation and installation are finished you can train a model on OLPBENCH. 
+
+1. [Run training](#run-training)
+2. [Prepared configurations](#prepared-configurations)
+3. [Available options](#available-options)
+
+### Run training
+
+Run the training with:
+
+```  
+python scripts/train.py [TRAIN_CONFIG_YAML] [OPTIONS] 
+```  
+
+TRAIN_CONFIG_YAML is a yaml config file. The possible options are documented in:
+
+[openkge/default.yaml](openkge/default.yaml)
+
+All top level options can also be given set on the command line.
+
+### Run evaluation
+
+Run evaluation on test data with:
+
+```  
+python scripts/train.py --resume data/experiments/.../checkpoint.pth.tar --evaluate True --evaluate_on_validation False
+```  
+
+_--resume_ epects the path to a checkpoint file
+
+_--evaluate_on_validation False_ sets the evaluation to run on test data
+
+
+### Prepared configurations
+
+In the config folder you will find the following configurations:
+
+- [config/acl2020-openlink/wikiopenlink-thorough-complex-lstm.yaml](config/acl2020-openlink/wikiopenlink-thorough-complex-lstm.yaml) is a configuration to train a OpenKGE model on the open link benchmark data.
+
+### Models
+
+###### Lookup based models (standard KGE)
+
+-    LookupTucker3RelationModel 
+-    LookupDistmultRelationModel
+-    LookupComplexRelationModel
+
+###### Token based models 
+
+*Compute the entity and relation embeddings by pooling token embeddings*
+-    UnigramPoolingComplexRelationModel
+
+*Compute the entity and relation embeddings with a sliding window CNN*
+-    BigramPoolingComplexRelationModel 
+
+*Compute the entity and relation embeddings with a LSTM*
+-    LSTMDistmultRelationModel 
+-    LSTMComplexRelationModel
+-    LSTMTucker3RelationModel
+
+###### Diagnostic models
+
+-    DataBiasOnlyEntityModel 
+-    DataBiasOnlyRelationModel 
+
+
+For model options see the init of the respective class. Additional combinations of score and embedding functions can be easily created:
+
+```  
+class BigramPoolingDistmultRelationModel(DistmultRelationScorer, BigramPoolingRelationEmbedder):
+
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+```  
+
+
+
+
+## Citation
+
+if you find this code useful for your research please cite 
+
+```
+@inproceedings{broscheit-etal-2020-predict,
+    title = "Can We Predict New Facts with Open Knowledge Graph Embeddings? A Benchmark for Open Link Prediction",
+    author = "Broscheit, Samuel  and
+      Gashteovski, Kiril  and
+      Wang, Yanjie  and
+      Gemulla, Rainer",
+    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
+    month = jul,
+    year = "2020",
+    address = "Online",
+    publisher = "Association for Computational Linguistics",
+    url = "https://www.aclweb.org/anthology/2020.acl-main.209",
+    doi = "10.18653/v1/2020.acl-main.209",
+    pages = "2296--2308",
+}
+```