Releases: huggingface/evaluate
Releases · huggingface/evaluate
v0.4.2
What's Changed
- Update the documentation and citation of mauve by @krishnap25 in #416
- Remove unused dependency by @daskol in #507
- Add confusion matrix by @osanseviero in #528
- Update python to 3.8 by @qubvel in #571
- Fix FileFreeLock by @lhoestq in #578
- Fix example doc in load function by @alexrs in #575
- Speeding up mean_iou metric computation by @qubvel in #569
New Contributors
- @rtrompier made their first contribution in #510
- @daskol made their first contribution in #507
- @qubvel made their first contribution in #571
- @alexrs made their first contribution in #575
Full Changelog: v0.4.1...v0.4.2
v0.4.1
What's Changed
- Add code example to docstrings by @stevhliu in #374
- [Minor fix] Typo by @cakiki in #403
- [Docs] fixed a typo in bertscore readme by @hazrulakmal in #386
- Add max_length kwarg to docstring of Perplexity measurement by @kdutia in #411
- Fix minor typo in a_quick_tour.mdx by @tupini07 in #417
- Fix Docs base_evaluator.mdx by @jorahn in #418
- Update Gradio description to clarify text-based input by @BramVanroy in #427
- fix
add
method by @hazrulakmal in #424 - Fix broken link in docs/a_quick_tour.mdx by @tupini07 in #419
- resolve #379 audio classification evaluator + docs by @Plutone11011 in #405
- fixed kwargs not being passed in combine by @Plutone11011 in #425
- add r^2 metric by @TKaanKoc in #407
- Update spaces gradio version to 3.19.1 by @BramVanroy in #426
- replace evaluate DownloadConfig with datasets by @lvwerra in #447
- Render Text2TextGenerationEvaluators' docstring examples by @mariosasko in #463
- Trigger CI on ci-* branches by @Wauplin in #467
- Update comet by @ricardorei in #443
- Fix
datasets
import in Meteor metric by @mariosasko in #490 - fix scikit-learn package name suggestion by @bzz in #498
- Release: 0.4.1 by @lhoestq in #505
New Contributors
- @cakiki made their first contribution in #403
- @hazrulakmal made their first contribution in #386
- @kdutia made their first contribution in #411
- @tupini07 made their first contribution in #417
- @jorahn made their first contribution in #418
- @Plutone11011 made their first contribution in #405
- @TKaanKoc made their first contribution in #407
- @mariosasko made their first contribution in #463
- @Wauplin made their first contribution in #467
- @ricardorei made their first contribution in #443
- @bzz made their first contribution in #498
- @lhoestq made their first contribution in #505
Full Changelog: v0.4.0...v0.4.1
v0.4.0
What's Changed
- add trainer integration docs by @lvwerra in #325
- Stop using model-defined truncation in perplexity calculation by @mathemakitten in #333
- Don't use eval for Evaluator instances in the doc by @fxmarty in #341
- fix caching by @lvwerra in #336
- Fix #327 set default row of gradio webui to 1 and drop empty/blank row by @Raibows in #335
- Update pr docs actions by @mishig25 in #344
- Fix
scikit-learn
install in spaces by @lvwerra in #345 - added MASE, sMAPE and MAPE metrics by @kashif in #330
- fix sklearn dependency in mape, mase and smape by @lvwerra in #346
- Update link text by @stevhliu in #360
- Corrected range of MAE by @clefourrier in #359
- Revert "Update pr docs actions" by @mishig25 in #363
- Evaluation suite by @mathemakitten in #337
- Matthews correlation coefficient by @sanderland in #362
- fix tf version by @lvwerra in #372
- Add TextGeneration Evaluator by @NimaBoscarino in #350
- Fix typo in rouge types by @davebulaval in #364
- Add
Evaluate
usage forscikit-learn
by @awinml in #368 - Adding metric visualization by @sashavor in #342
- Add NIST metric by @BramVanroy in #250
- add GitHub Actions CI by @lvwerra in #375
- Add Evaluate Usage for Keras and Tensorflow by @arjunpatel7 in #370
- fix version by @lvwerra in #380
- CharacTER: MT metric by @BramVanroy in #286
- CharCut: another character-based MT evaluation metric by @BramVanroy in #290
- asr model evaluator addition + doc by @bayartsogt-ya in #378
- Docs for EvaluationSuite by @mathemakitten in #340
- Update the documentation of Mauve by @krishnap25 in #377
- fix-ci-badge by @lvwerra in #385
New Contributors
- @Raibows made their first contribution in #335
- @kashif made their first contribution in #330
- @clefourrier made their first contribution in #359
- @davebulaval made their first contribution in #364
- @awinml made their first contribution in #368
- @arjunpatel7 made their first contribution in #370
- @bayartsogt-ya made their first contribution in #378
- @krishnap25 made their first contribution in #377
Full Changelog: v0.3.0...v0.4.0
v0.3.0
What's Changed
- add multilabel f1 eval usage by @fcakyon in #221
- Force get_supported_tasks() to return a list instead of dict keys by @mathemakitten in #227
- Unpin rouge_score by @albertvillanova in #220
- Remove import statement in Measurement Card by @meg-huggingface in #231
- make rouge support multi-ref by @lvwerra in #229
- Fix enforce string by @lvwerra in #230
- Fix examples in perplexity measurement docs by @mathemakitten in #238
- Add Wilcoxon's signed rank test by @douwekiela in #237
- Add support for two input columns for TextClassificationEvaluator by @fxmarty in #205
- fix bug in TEMPLATE_REQUIRE: add comma by @BramVanroy in #248
- Minor quicktour doc suggestions by @stevhliu in #236
- Clarify error message for ChrF no. references by @BramVanroy in #247
- only track unique missing dependencies by @BramVanroy in #246
- Update evaluate in spaces by @lvwerra in #228
- add
commit_hash
to args by @lvwerra in #253 - Change perplexity to be calculated with base e by @mathemakitten in #242
- Rebase for previous PR by @mathemakitten in #254
- Fix docstrings with new perplexities with base e by @mathemakitten in #255
- add a tokenizer option to rouge by @lvwerra in #258
- Adding list_duplicates=True to example. by @meg-huggingface in #263
- Minor change in describing what this does. by @meg-huggingface in #267
- Mapping example output to returned output. by @meg-huggingface in #268
- Changes "duplicates_list" to "duplicates_dict" (since it's dict) by @meg-huggingface in #265
- Changes "duplicates_list" to "duplicates_dict" in the example. by @meg-huggingface in #264
- Add slow flag to two column parity test by @lvwerra in #273
- Remove
handle_impossible_answer
from the defaultPIPELINE_KWARGS
in the question answering evaluator by @fxmarty in #272 - Toxicity Measurement by @sashavor in #262
- Automatically choose dataset split if none provided by @mathemakitten in #232
- Fix YAML in Toxicity by @lvwerra in #278
- Added metric Brier Score by @kadirnar in #275
- Check for mismatch in device setup in evaluator by @mathemakitten in #287
- Fix transfomers import in the evaluator by @mathemakitten in #291
- Add support for name field when loading data by @mathemakitten in #283
- Adding regard measurement by @sashavor in #271
- Raise exception instead of assert in BertScore by @BramVanroy in #292
- fix regard yaml by @lvwerra in #295
- Add CONTRIBUTING.md by @mathemakitten in #293
- Refactor kwargs and configs by @lvwerra in #188
- Revert "Refactor kwargs and configs" by @lvwerra in #299
- Add missing
split
andsubset
kwarg into other evaluators by @mathemakitten in #301 - Adding HONEST score by @sashavor in #279
- fix wrong sorting in check by @sanderland in #305
- Fix HONEST yaml by @lvwerra in #303
- Refactor current_features to selected_feature_format by @mathemakitten in #306
- replace datasets list with local list of tasks by @lvwerra in #309
- Adding torch to the requirements by @sashavor in #311
- Honest space fix by @sashavor in #312
- Use HTML relative paths for tiles by @lewtun in #318
- Test for valid YAML files by @mathemakitten in #308
- add versioning the
HubEvaluationModuleFactory
by @lvwerra in #314 - Add text2text evaluator by @lvwerra in #261
- try main if tag does not work by @lvwerra in #322
New Contributors
- @fcakyon made their first contribution in #221
- @meg-huggingface made their first contribution in #231
- @stevhliu made their first contribution in #236
- @kadirnar made their first contribution in #275
- @sanderland made their first contribution in #305
Full Changelog: v0.2.2...v0.3.0
v0.2.2
v0.2.1
What's Changed
- Add measurements to quality and style checks by @lvwerra in #203
- Add comparisons and measurements to code quality tests by @lvwerra in #204
- Remove mention to datasets from docs by @albertvillanova in #207
- Adding label distribution measurement by @sashavor in #202
- Fix spaces tagging by @lvwerra in #217
- set datasets to >=2.0.0 by @lvwerra in #216
Full Changelog: v0.2.0...v0.2.1
v0.2.0
What's New
evaluator
The evaluator
has been extended to three new tasks:
"image-classification"
"token-classification"
"question-answering"
combine
With combine
one can bundle several metrics into a single object that can be evaluated in one call and also used in combination with the evalutor
.
What's Changed
- Fix typo in WER docs by @pn11 in #147
- Fix rouge outputs by @lvwerra in #158
- add tutorial for custom pipeline by @lvwerra in #154
- refactor
evaluator
tests by @lvwerra in #155 - rename
input_texts
topredictions
in perplexity by @lvwerra in #157 - Add link to GitHub author by @lewtun in #166
- Add
combine
to compose multiple evaluations by @lvwerra in #150 - test string casting only on first element by @lvwerra in #159
- remove unused fixtures from unittests by @lvwerra in #170
- Add a test to check that Evaluator evaluations match transformers examples by @fxmarty in #163
- Add smaller model for
TextClassificationEvaluator
test by @fxmarty in #172 - Add tags to spaces by @lvwerra in #162
- Rename evaluation modules by @lvwerra in #160
- Update push_evaluations_to_hub.py by @lvwerra in #174
- update evaluate dependency for spaces by @lvwerra in #175
- Add
ImageClassificationEvaluator
by @fxmarty in #173 - attempting to let meteor handle multiple references per prediction by @sashavor in #164
- fixed duplicate calculation of spearmanr function in metrics wrapper. by @benlipkin in #176
- forbid hyphens in template for module names by @lvwerra in #177
- switch from Github to Hub module factory for canonical modules by @lvwerra in #180
- Fix bertscore idf by @lvwerra in #183
- refactor evaluator base and task classes by @lvwerra in #185
- Avoid importing tensorflow when importing evaluate by @NouamaneTazi in #135
- Add QuestionAnsweringEvaluator by @fxmarty in #179
- Evaluator perf by @ola13 in #178
- Fix QuestionAnsweringEvaluator for squad v2, fix examples by @fxmarty in #190
- Rename perf metric evaluator by @lvwerra in #191
- Fix typos in QA Evaluator by @lewtun in #192
- Evaluator device placement by @lvwerra in #193
- Change test command in installation.mdx to use exact_match by @mathemakitten in #194
- Add
TokenClassificationEvaluator
by @fxmarty in #167 - Pin rouge_score by @albertvillanova in #197
- add poseval by @lvwerra in #195
- Combine docs by @lvwerra in #201
- Evaluator column loading by @lvwerra in #200
- Evaluator documentation by @lvwerra in #199
New Contributors
- @pn11 made their first contribution in #147
- @fxmarty made their first contribution in #163
- @benlipkin made their first contribution in #176
- @NouamaneTazi made their first contribution in #135
- @mathemakitten made their first contribution in #194
Full Changelog: v0.1.2...v0.2.0
v0.1.2
What's Changed
- Fix trec sacrebleu by @lvwerra in #130
- Add distilled version Cometihno by @BramVanroy in #131
- fix: add yaml extension to github action for release by @lvwerra in #133
- fix docs badge by @lvwerra in #134
- fix cookiecutter path to repository by @lvwerra in #139
- docs: make metric cards more prominent by @lvwerra in #132
- Update README.md by @sashavor in #145
- Fix datasets download imports by @albertvillanova in #143
New Contributors
- @BramVanroy made their first contribution in #131
- @albertvillanova made their first contribution in #143
Full Changelog: v0.1.1...v0.1.2
v0.1.1
What's Changed
- Fix broken links by @mishig25 in #92
- Fix readme by @lvwerra in #98
- Fixing broken evaluate-measurement hub link by @panwarnaveen9 in #102
- fix typo in autodoc by @manueldeprada in #101
- fix typo by @manueldeprada in #100
- FIX
pip install evaluate[evaluator]
by @philschmid in #103 - fix description field in metric template readme by @lvwerra in #122
- Add automatic pypi release for evaluate by @osanseviero in #121
- Fix typos in Evaluator docstrings by @lewtun in #124
- Fix spaces description in metadata by @lvwerra in #123
- fix revision string if it is a python version by @lvwerra in #129
- Use accuracy as default metric for text classification Evaluator by @lewtun in #128
- bump
evaluate
dependency in spaces by @lvwerra in #88
New Contributors
- @panwarnaveen9 made their first contribution in #102
- @manueldeprada made their first contribution in #101
- @philschmid made their first contribution in #103
- @osanseviero made their first contribution in #121
- @lewtun made their first contribution in #124
Full Changelog: v0.1.0...v0.1.1
Initial relase of `evaluate`
Release notes
These are the release notes of the initial release of the Evaluate library.
Goals
Goals of the Evaluate library:
- reproducibility: reporting and reproducing results is easy
- ease-of-use: access to a wide range of evaluation tools with a unified interface
- diversity: provide wide range of evaluation tools with metrics, comparisons, and measurements
- multimodal: models and datasets of many modalities can be evaluated
- community-driven: anybody can add custom evaluations by hosting them on the Hugging Face Hub
Release overview:
evaluate.load()
: Theload()
function is the main entry point into evaluate and allows to load evaluation modules from a local folder, the evaluate repository, or the Hugging Face Hub. It downloads, caches, and loads the evaluation modules and returns anevaluate.EvaluationModule
.evaluate.save()
: Withsave()
a user can save evaluation results in a JSON file. In addition to the results fromevaluate.EvaluationModule
it can save additional parameters and automatically saves the timestamp, git commit hash, library version as well as Python path. One can either provide a directory for the results, in which case file names are automatically created, or an explicit file name for the result.evaluate.push_to_hub()
: Thepush_to_hub
function allows to push the results of a model evaluation to the model card on the Hugging Face Hub. The model, dataset, and metric are specified such that they can be linked on the hub.evaluate.EvaluationModule
: TheEvaluationModule
class is the baseclass for all evaluation modules. There are three module types: metrics (to evaluate models), comparisons (to compare models), and measurements (to analyze datasets). The inputs can be either added withadd
(single input) andadd_batch
(batch of inputs) followed by a finalcompute
call to compute the scores or all inputs can be passed tocompute
directly. Under the hood, Apache Arrow stores and loads the input data to compute the scores.evaluate.EvaluationModuleInfo
: TheEvaluationModule
class is used to store attributes:description
: A short description of the evaluation module.citation
: A BibTex string for citation when available.features
: AFeatures
object defining the input format. The inputs provided toadd
,add_batch
, andcompute
are tested against these types and an error is thrown in case of a mismatch.inputs_description
: This is equivalent to the modules docstring.homepage
: The homepage of the module.license
: The license of the module.codebase_urls
: Link to the code behind the module.reference_urls
: Additional reference URLs.
evaluate.evaluator
: Theevaluator
provides automated evaluation and only requires a model, dataset, metric, in contrast to the metrics in theEvaluationModule
which require model predictions. It has three main components: a model wrapped in a pipeline, a dataset, and a metric, and it returns the computed evaluation scores. Besides the three main components, it may also require two mappings to align the columns in the dataset and the pipeline labels with the datasets labels. This is an experimental feature -- currently, only text classification is supported.evaluate-cli
: The community can add custom metrics by adding the necessary module script to a Space on the Hugging Face Hub. Theevaluate-cli
is a tool that simplifies this process by creating the Space, populating a template, and pushing it to the Hub. It also provides instructions to customize the template and integrate custom logic.
Main contributors:
@lvwerra , @sashavor , @NimaBoscarino , @ola13 , @osanseviero , @lhoestq , @lewtun , @douwekiela