Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Evaluator and/or Benchmark repositories #33

Open
rmarronnier opened this issue Sep 16, 2019 · 5 comments
Open

Proposal: Evaluator and/or Benchmark repositories #33

rmarronnier opened this issue Sep 16, 2019 · 5 comments
Labels
enhancement New feature or request

Comments

@rmarronnier
Copy link
Member

Preface

Evaluating the accuracy of the output of an NLP component is a science in itself.

When a new NLP algorithm, method or tool is published, it is always accompanied by benchmarks against existing systems.

Those benchmarks are produced using standard evaluation techniques and dataset.

These evaluation techniques are not always automatic.

A human judgment is sometimes necessary. In this case, there's nothing Cadmium can do to help.

However a set of existing tools exist depending on the NLP task to be tested :

  • Precision, recall and F1 Score are useful statistical metrics when evaluating classification POS tagging, sentiment analysis, etc.
  • METEOR or BLEU if Cadmium ever does machine translation.
  • ROUGE for summarization evaluation

We can add to those tools standard dataset and corpora already gold labeled and human checked.

These are just examples found after a cursory search. The list is bigger and the tools get better fast.

Details

The main idea of this proposal is to :

  • Create a cadmiumcr/evaluator repository.
    This module will have the tools listed above and methods to conveniently download the large datasets of gold labelled data.

  • Create a cadmiumcr/benchmark repository.
    This repository will be more like a custom set of crystal scripts using the tools of Cadmium::Evaluator to run benchmarks against the vanilla tools of Cadmium (classifiers, pos tagging, language identification, etc.) and display the results next to competing tools results.

The point being to give a glimpse of Cadmium possibilities and routinely check our tools accuracy (which crystal spec is not intended to do).

This proposal is mainly a braindump, as I don't intend to start working on this short term (I have to finish my POS Tagger first !)

@watzon
Copy link
Member

watzon commented Sep 16, 2019

I love the idea. Maybe we can use GitHub actions to automate the benchmarks and evaluators whenever a repo get's a push to master.

@rmarronnier
Copy link
Member Author

Yeah ! If it can download x00 megs datasets then I can't see why not !
It would be fantastic if Github actions could generate a json file to be used by d3.js on the website or at least produce a nice svg with the help of graphviz... Ok, I'm a dreamer 😄

@watzon
Copy link
Member

watzon commented Sep 16, 2019

There's no reason it couldn't. All you need is a docker container that can do it.

@rmarronnier
Copy link
Member Author

Yeah, you're right. One thing to keep in mind : Each job in a workflow can run for up to 6 hours of execution time. Usage limits

@watzon
Copy link
Member

watzon commented Sep 16, 2019

6 hours is insane. I doubt we'll even come close.

@watzon watzon transferred this issue from another repository Nov 7, 2019
@watzon watzon added enhancement New feature or request invalid This doesn't seem right in progress and removed invalid This doesn't seem right in progress labels Nov 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants