Skip to content

Commit

Permalink
Address two issues: Add FAQ. Add info, code and more data to run code…
Browse files Browse the repository at this point in the history
… on FB15k237.
  • Loading branch information
samuelbroscheit committed Dec 15, 2021
1 parent 1ce37a4 commit 8f5bcc5
Show file tree
Hide file tree
Showing 8 changed files with 89 additions and 15 deletions.
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -160,4 +160,7 @@ venv.bak/
.idea/
*.avro

ignore
ignore
/data/local/
/data/fb15k237/mapped_to_ids/
/data/experiments/
33 changes: 26 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ All top level options can also be set on the command line and override the yaml
If you run training on a dataset the first time some indexes will be created and cached. For OLPBENCH this can take around 30 minutes and up to 10-20 GB of main memory! After the cached files are created the startup takes under 1 minute.
###### Prepared configurations
##### Prepared configurations
A token-based model for the OLPBench benchmark.
Expand Down Expand Up @@ -110,27 +110,27 @@ _--evaluate_on_validation False_ sets the evaluation to run on test data
See [openkge/model.py](openkge/model.py)
###### Lookup based models (standard KGE)
##### Lookup based models (standard KGE)
- LookupTucker3RelationModel
- LookupDistmultRelationModel
- LookupComplexRelationModel
###### Token based model to compute the entity and relation embeddings by pooling token embeddings
##### Token based model to compute the entity and relation embeddings by pooling token embeddings
- UnigramPoolingComplexRelationModel
###### Token based model to compute the entity and relation embeddings with a sliding window CNN
##### Token based model to compute the entity and relation embeddings with a sliding window CNN
- BigramPoolingComplexRelationModel
###### Token based model to compute the entity and relation embeddings with a LSTM
##### Token based model to compute the entity and relation embeddings with a LSTM
- LSTMDistmultRelationModel
- LSTMComplexRelationModel
- LSTMTucker3RelationModel
###### Diagnostic models
##### Diagnostic models
- DataBiasOnlyEntityModel
- DataBiasOnlyRelationModel
Expand Down Expand Up @@ -173,7 +173,7 @@ and then start the server in with ./bin/elasticsearch. Then run the preprocessin
python scripts/create_data.py -c config/preprocessing/prototype.yaml
```
###### Prepared configurations to create OLPBENCH from scratch
##### Prepared configurations to create OLPBENCH from scratch
- [config/preprocessing/prototype.yaml](config/preprocessing/prototype.yaml) a configuration for prototyping the pipeline
Expand All @@ -182,9 +182,28 @@ python scripts/create_data.py -c config/preprocessing/prototype.yaml
## Use this code for experiments on FB15k237
##### Prepare data
```
cd data/fb15k237
python prepare_fb237.py
```
##### Prepared configurations
A token-based model:
- [config/fb15k237/fb15k237-complex-lstm.yaml](config/fb15k237/fb15k237-complex-lstm.yaml) is a configuration to train a OpenKGE model on FB15k237 using token descriptions of the data.
- [config/fb15k237/fb15k237-complex-kge.yaml](config/fb15k237/fb15k237-complex-kge.yaml) is a configuration to train a OpenKGE model on FB15k237 using standard KGE lookup embeddings of the data.
## FAQ
##### What is the meaning of the prefixes for some relation tokens?
This is additional information about how a triple was extracted. For example, *has:impl_poss-clause* was extracted from a sentence which did not explicitly say "New York has a mayor ...", but from a implicit possessive relation, similar to a construction in "New York's mayor ...". OLPBENCH is based on OPIEC (https://openreview.net/forum?id=HJxeGb5pTm) which was created with the system MINIE (see Implicit extractions in https://aclanthology.org/D17-1278.pdf), which uses the patterns described in FINET (https://aclanthology.org/D15-1103.pdf). Check out the last two papers to learn more about the implicit extractions that can occur in this dataset. Some of those patterns can be noisy and therefore might require special treatment, which is why this information is present in the data. For instance, for the evaluation data we chose the heuristic to sample only from relations with three or more words. This automatically excluded some implicit extractions for evaluation. If a model cannot handle this additional information then a simple approach is to just ignore everything after the colon. In our work we treated *has:impl_poss-clause* as a different token from *has*.
## Citation
Expand Down
4 changes: 2 additions & 2 deletions config/fb15k237/fb15k237-complex-kge.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ no_cuda: false
# seed for the RNG
seed: 0
# save output to base directory
results_dir: null
results_dir: data/experiments/fb237/
# leave empty if left empty (null, None) then a name is automatically generated
experiment_dir: null

Expand Down Expand Up @@ -149,7 +149,7 @@ patience_metric_max_treshold: null
#
# max_size_prefix_label: 64

dataset_dir: data/fb15k237
dataset_dir: data/fb15k237/mapped_to_ids

dataset_class: OneToNMentionRelationDataset

Expand Down
4 changes: 2 additions & 2 deletions config/fb15k237/fb15k237-complex-lstm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ no_cuda: false
# seed for the RNG
seed: 0
# save output to base directory
results_dir: data/experiments/fb15k237/
results_dir: data/experiments/fb237/
# leave empty if left empty (null, None) then a name is automatically generated
experiment_dir: null

Expand Down Expand Up @@ -151,7 +151,7 @@ patience_metric_max_treshold: null
#
# max_size_prefix_label: 64

dataset_dir: data/fb15k237
dataset_dir: data/fb15k237/mapped_to_ids

dataset_class: OneToNMentionRelationDataset

Expand Down
4 changes: 2 additions & 2 deletions config/fb15k237/fb15k237-complex-unigrampool.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ no_cuda: false
# seed for the RNG
seed: 0
# save output to base directory
results_dir: data/experiments/fb15k237/
results_dir: data/experiments/fb237/
# leave empty if left empty (null, None) then a name is automatically generated
experiment_dir: null

Expand Down Expand Up @@ -151,7 +151,7 @@ patience_metric_max_treshold: null
#
# max_size_prefix_label: 64

dataset_dir: data/fb15k237
dataset_dir: data/fb15k237/mapped_to_ids

dataset_class: OneToNMentionRelationDataset

Expand Down
Binary file added data/fb15k237/mid2name.tsv.gz
Binary file not shown.
52 changes: 52 additions & 0 deletions data/fb15k237/prepare_fb237.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
import gzip
import os

from openkge.index_mapper import IndexMapper
from utils.map_dataset_to_ids import convert_datasets, save_to_file, save_id_to_tokens_map

if __name__ == "__main__":

with gzip.open('mid2name.tsv.gz', 'rb') as f:
mid2name = {mid_name[0].decode():[m.decode() for m in mid_name[1:]] for mid_name in [line.split() for line in f.readlines()]}

entity_index_mapper = IndexMapper(
segment=True,
segment_func=lambda line: mid2name.get(line, [line])
)

relation_index_mapper = IndexMapper(
segment=True,
segment_func=lambda line: line.replace('/', ' / ').replace('.', ' . ').replace('_', ' ').split(),
)

with open('train.txt') as train:
with open('valid.txt') as valid:
with open('test.txt') as test:
train_converted, \
valid_converted, \
test_converted, \
entity_id_token_ids_map, \
relation_id_token_ids_map = convert_datasets(
train=train.readlines(),
valid=valid.readlines(),
test=test.readlines(),
subj_index_mapper=entity_index_mapper,
obj_index_mapper=entity_index_mapper,
rel_index_mapper=relation_index_mapper,
triple_format_parser=lambda x: x.strip().split(),
segment=True,
)

if not os.path.exists('mapped_to_ids'):
os.makedirs('mapped_to_ids')

entity_index_mapper.save_vocab(os.path.join('mapped_to_ids', 'entity'))
relation_index_mapper.save_vocab(os.path.join('mapped_to_ids', 'relation'))

save_id_to_tokens_map('mapped_to_ids', 'entity', entity_id_token_ids_map)
save_id_to_tokens_map('mapped_to_ids', 'relation', relation_id_token_ids_map)

save_to_file('mapped_to_ids', 'train.txt', train_converted)
save_to_file('mapped_to_ids', 'valid.txt', valid_converted)
save_to_file('mapped_to_ids', 'test.txt', test_converted)

2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
torch>=1.5.1
torch>=1.10.0
elasticsearch
avro
numpy
Expand Down

0 comments on commit 8f5bcc5

Please sign in to comment.