Skip to content

Converting Colab notebook results to CoNLL format #74

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
linguist89 opened this issue Dec 16, 2020 · 11 comments
Open

Converting Colab notebook results to CoNLL format #74

linguist89 opened this issue Dec 16, 2020 · 11 comments

Comments

@linguist89
Copy link

I've been running the notebook and getting the results to work fine, but I want to convert the results into the CoNLL format so that I can compare documents from the CRAFT corpus using the LEA metric. Is there anyway to convert the output (i.e. the sample.out.txt) file to CoNLL format?

@mandarjoshi90
Copy link
Owner

mandarjoshi90 commented Dec 16, 2020

I haven't used the notebook, so I might be missing something. If I understand this right, you're trying to convert the jsonlines output of predict.py. Perhaps, you could directly use evaluate.py? It will create a temp file in CoNLL format (just look through the log of evaluate.py for a file in /tmp) which is processed by the Perl script.

@linguist89
Copy link
Author

Thanks for the response. I've been trying that a few different ways, but it doesn't save any temp files. I'm running the evaluate script like this with $CHOSEN_MODEL being bert_base: !python evaluate.py $CHOSEN_MODEL
and the directories in the evironment.conf file is as follows:

train_path = ${data_dir}/train.english.128.jsonlines
eval_path = ${data_dir}/dev.english.128.jsonlines
conll_eval_path = ${data_dir}/dev.english.v4_gold_conll

I have set eval_mode to false in the evaluate.py file because I don't need it to compare anything to CoNLL, I just need that temp file (CoNLL format that you mentioned). It does not produce any file though, just outputs the results (which are 0%, but that's expected). I have saved the output of the predict.py file as dev.english.128.jsonlines, so it's loading the output of predict as the dev set. Is this the correct way to use it?

@mandarjoshi90
Copy link
Owner

Yeah that's correct. The specific lines you need should be something like this:

Use standard file APIs to check for files with this prefix.
Loaded 343 eval examples.
Predicted conll file: /tmp/tmpi6yrp0jo
Official result for muc
version: 8.01 /data/BERT-coref/coref/conll-2012/scorer/v8.01/lib/CorScorer.pm

====== TOTALS =======
...

IIRC, this should be the relevant code: https://github.com/mandarjoshi90/coref/blob/master/conll.py#L92

@linguist89
Copy link
Author

I get that output and the file in tmp directory, but it's always empty, so I'm not sure what's going on.
Also with the conll.py file, I can't really make heads or tails of how the input data should be structured. Is there a way to figure that about because I've tried to work through the code, but can't get results from it. Any advice?

@mandarjoshi90
Copy link
Owner

I see. I'm not quite sure why you're getting the output in the jsonlines file but not the tmp file. I can't think of anything that's obviously wrong. Have you tried stepping through this function? At the very least, the variables predictions and subtoken_map should be populated, and if so, that would indicate a problem further down the pipeline.

https://github.com/mandarjoshi90/coref/blob/master/conll.py#L17

@handesirikci
Copy link

@linguist89 I’m also dealing with the same problem as you. Did you find any solution?

@linguist89
Copy link
Author

@handesirikci I have not revisited the problem in a while because I've had my time taken up with a different project. I am going to have to get back to it soon though. However, I did come up with something that can be tried:

  1. Using the Colab Notebook's code for converting the BERT output into human-readable clusters, I think you could add those clusters to a CoNLL-parsed (without coref) version of the output.
  2. There are so many different ways people have CoNLL formatted (i.e. not a consistent amount of columns), but what I've seen is that Coref is always the last column.

This might be an approach you could try. I will be revisiting this problem in a week or so, but if you manage to figure out before then please post it here.

@handesirikci
Copy link

@linguist89 finally we have found a way to print results in tmp directory and also obtained the results of the evaluation. We found the tokenized and gold annotated thirty files in this repo. We gave the already tokenized data to model as input and the gold annotated format of the article in file named as "dev.english.v4_gold_conll". But you have to change the first column of the gold annotated file from doc id to genre name which is "nw" in our case.

@linguist89
Copy link
Author

@handesirikci Thanks for the update. I've done everything you specified, but I get the following error:

Traceback (most recent call last):
File "evaluate.py", line 26, in
model.evaluate(session, official_stdout=True, eval_mode=True)
File "/content/coref/independent.py", line 564, in evaluate
conll_results = conll.evaluate_conll(self.config["conll_eval_path"], coref_predictions, self.subtoken_maps, official_stdout )
File "/content/coref/conll.py", line 95, in evaluate_conll
output_conll(gold_file, prediction_file, predictions, subtoken_maps)
File "/content/coref/conll.py", line 46, in output_conll
start_map, end_map, word_map = prediction_map[doc_key]
KeyError: 'nw_0'

I changed the name of the first column, but it's giving me this error. Did you just change it to "nw" or were there other characters as well?

@handesirikci
Copy link

@linguist89 if you got this error message, you have to change name of genre to "nw_0". hope it works

@sayalighodekar
Copy link

I've been running the notebook and getting the results to work fine, but I want to convert the results into the CoNLL format so that I can compare documents from the CRAFT corpus using the LEA metric. Is there anyway to convert the output (i.e. the sample.out.txt) file to CoNLL format?

https://github.com/boberle/corefconversion
hope this repo helps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants