Address two issues: Add FAQ. Add info, code and more data to run code…

… on FB15k237.
samuelbroscheit · Dec 15, 2021 · 8f5bcc5 · 8f5bcc5
1 parent 1ce37a4
commit 8f5bcc5
Show file tree

Hide file tree

Showing 8 changed files with 89 additions and 15 deletions.
diff --git a/.gitignore b/.gitignore
@@ -160,4 +160,7 @@ venv.bak/
 .idea/
 *.avro
 
-ignore
+ignore
+/data/local/
+/data/fb15k237/mapped_to_ids/
+/data/experiments/
diff --git a/README.md b/README.md
@@ -74,7 +74,7 @@ All top level options can also be set on the command line and override the yaml
 If you run training on a dataset the first time some indexes will be created and cached. For OLPBENCH this can take around 30 minutes and up to 10-20 GB of main memory! After the cached files are created the startup takes under 1 minute. 
 
 
-###### Prepared configurations
+##### Prepared configurations
     
 A token-based model for the OLPBench benchmark.
 
@@ -110,27 +110,27 @@ _--evaluate_on_validation False_ sets the evaluation to run on test data
 
 See [openkge/model.py](openkge/model.py)
 
-###### Lookup based models (standard KGE)
+##### Lookup based models (standard KGE)
 
 -    LookupTucker3RelationModel 
 -    LookupDistmultRelationModel
 -    LookupComplexRelationModel
 
-###### Token based model to compute the entity and relation embeddings by pooling token embeddings
+##### Token based model to compute the entity and relation embeddings by pooling token embeddings
 
 -    UnigramPoolingComplexRelationModel
 
-###### Token based model to compute the entity and relation embeddings with a sliding window CNN
+##### Token based model to compute the entity and relation embeddings with a sliding window CNN
 
 -    BigramPoolingComplexRelationModel 
 
-###### Token based model to compute the entity and relation embeddings with a LSTM
+##### Token based model to compute the entity and relation embeddings with a LSTM
 
 -    LSTMDistmultRelationModel 
 -    LSTMComplexRelationModel
 -    LSTMTucker3RelationModel
 
-###### Diagnostic models
+##### Diagnostic models
 
 -    DataBiasOnlyEntityModel 
 -    DataBiasOnlyRelationModel 
@@ -173,7 +173,7 @@ and then start the server in with ./bin/elasticsearch. Then run the preprocessin
 python scripts/create_data.py -c config/preprocessing/prototype.yaml
 ```
 
-###### Prepared configurations to create OLPBENCH from scratch
+##### Prepared configurations to create OLPBENCH from scratch
     
 
 - [config/preprocessing/prototype.yaml](config/preprocessing/prototype.yaml) a configuration for prototyping the pipeline
@@ -182,9 +182,28 @@ python scripts/create_data.py -c config/preprocessing/prototype.yaml
 
 
 
+## Use this code for experiments on FB15k237
 
+##### Prepare data
 
+```  
+cd data/fb15k237
+python prepare_fb237.py
+```  
+
+##### Prepared configurations
+
+A token-based model:
+
+- [config/fb15k237/fb15k237-complex-lstm.yaml](config/fb15k237/fb15k237-complex-lstm.yaml) is a configuration to train a OpenKGE model on FB15k237 using token descriptions of the data.
+
+- [config/fb15k237/fb15k237-complex-kge.yaml](config/fb15k237/fb15k237-complex-kge.yaml) is a configuration to train a OpenKGE model on FB15k237 using standard KGE lookup embeddings of the data.
+
+## FAQ
+
+##### What is the meaning of the prefixes for some relation tokens?
 
+This is additional information about how a triple was extracted. For example, *has:impl_poss-clause* was extracted from a sentence which did not explicitly say "New York has a mayor ...", but from a implicit possessive relation, similar to a construction in "New York's mayor ...". OLPBENCH is based on OPIEC (https://openreview.net/forum?id=HJxeGb5pTm) which was created with the system MINIE (see Implicit extractions in https://aclanthology.org/D17-1278.pdf), which uses the patterns described in FINET (https://aclanthology.org/D15-1103.pdf). Check out the last two papers to learn more about the implicit extractions that can occur in this dataset. Some of those patterns can be noisy and therefore might require special treatment, which is why this information is present in the data. For instance, for the evaluation data we chose the heuristic to sample only from relations with three or more words. This automatically excluded some implicit extractions for evaluation. If a model cannot handle this additional information then a simple approach is to just ignore everything after the colon. In our work we treated *has:impl_poss-clause* as a different token from *has*.
 
 ## Citation
 

diff --git a/config/fb15k237/fb15k237-complex-kge.yaml b/config/fb15k237/fb15k237-complex-kge.yaml
@@ -7,7 +7,7 @@ no_cuda: false
 # seed for the RNG
 seed: 0
 # save output to base directory
-results_dir: null
+results_dir: data/experiments/fb237/
 # leave empty if left empty (null, None) then a name is automatically generated
 experiment_dir: null
 
@@ -149,7 +149,7 @@ patience_metric_max_treshold: null
 #
 #   max_size_prefix_label: 64
 
-dataset_dir: data/fb15k237
+dataset_dir: data/fb15k237/mapped_to_ids
 
 dataset_class: OneToNMentionRelationDataset
 

diff --git a/config/fb15k237/fb15k237-complex-lstm.yaml b/config/fb15k237/fb15k237-complex-lstm.yaml
@@ -7,7 +7,7 @@ no_cuda: false
 # seed for the RNG
 seed: 0
 # save output to base directory
-results_dir: data/experiments/fb15k237/
+results_dir: data/experiments/fb237/
 # leave empty if left empty (null, None) then a name is automatically generated
 experiment_dir: null
 
@@ -151,7 +151,7 @@ patience_metric_max_treshold: null
 #
 #   max_size_prefix_label: 64
 
-dataset_dir: data/fb15k237
+dataset_dir: data/fb15k237/mapped_to_ids
 
 dataset_class: OneToNMentionRelationDataset
 

diff --git a/config/fb15k237/fb15k237-complex-unigrampool.yaml b/config/fb15k237/fb15k237-complex-unigrampool.yaml
@@ -7,7 +7,7 @@ no_cuda: false
 # seed for the RNG
 seed: 0
 # save output to base directory
-results_dir: data/experiments/fb15k237/
+results_dir: data/experiments/fb237/
 # leave empty if left empty (null, None) then a name is automatically generated
 experiment_dir: null
 
@@ -151,7 +151,7 @@ patience_metric_max_treshold: null
 #
 #   max_size_prefix_label: 64
 
-dataset_dir: data/fb15k237
+dataset_dir: data/fb15k237/mapped_to_ids
 
 dataset_class: OneToNMentionRelationDataset
 

diff --git a/data/fb15k237/mid2name.tsv.gz b/data/fb15k237/mid2name.tsv.gz
diff --git a/data/fb15k237/prepare_fb237.py b/data/fb15k237/prepare_fb237.py
@@ -0,0 +1,52 @@
+import gzip
+import os
+
+from openkge.index_mapper import IndexMapper
+from utils.map_dataset_to_ids import convert_datasets, save_to_file, save_id_to_tokens_map
+
+if __name__ == "__main__":
+
+    with gzip.open('mid2name.tsv.gz', 'rb') as f:
+        mid2name = {mid_name[0].decode():[m.decode() for m in mid_name[1:]] for mid_name in [line.split() for line in f.readlines()]}
+
+    entity_index_mapper = IndexMapper(
+        segment=True,
+        segment_func=lambda line: mid2name.get(line, [line])
+    )
+
+    relation_index_mapper = IndexMapper(
+        segment=True,
+        segment_func=lambda line: line.replace('/', ' / ').replace('.', ' . ').replace('_', ' ').split(),
+    )
+
+    with open('train.txt') as train:
+        with open('valid.txt') as valid:
+            with open('test.txt') as test:
+                train_converted, \
+                valid_converted, \
+                test_converted, \
+                entity_id_token_ids_map, \
+                relation_id_token_ids_map = convert_datasets(
+                    train=train.readlines(),
+                    valid=valid.readlines(),
+                    test=test.readlines(),
+                    subj_index_mapper=entity_index_mapper,
+                    obj_index_mapper=entity_index_mapper,
+                    rel_index_mapper=relation_index_mapper,
+                    triple_format_parser=lambda x: x.strip().split(),
+                    segment=True,
+                )
+
+                if not os.path.exists('mapped_to_ids'):
+                    os.makedirs('mapped_to_ids')
+
+                entity_index_mapper.save_vocab(os.path.join('mapped_to_ids', 'entity'))
+                relation_index_mapper.save_vocab(os.path.join('mapped_to_ids', 'relation'))
+
+                save_id_to_tokens_map('mapped_to_ids', 'entity', entity_id_token_ids_map)
+                save_id_to_tokens_map('mapped_to_ids', 'relation', relation_id_token_ids_map)
+
+                save_to_file('mapped_to_ids', 'train.txt', train_converted)
+                save_to_file('mapped_to_ids', 'valid.txt', valid_converted)
+                save_to_file('mapped_to_ids', 'test.txt', test_converted)
+
diff --git a/requirements.txt b/requirements.txt
@@ -1,4 +1,4 @@
-torch>=1.5.1
+torch>=1.10.0
 elasticsearch
 avro
 numpy