Releases · aphp/edsnlp

22 Jul 16:26

percevalw

v0.13.0

fa135e6

v0.13.0 Latest

Latest

Changelog

Added

data.set_processing(...) now expose an autocast parameter to disable or tweak the automatic casting of the tensor
during the processing. Autocasting should result in a slight speedup, but may lead to numerical instability.
Use torch.inference_mode to disable view tracking and version counter bumps during inference.
Added a new NER pipeline for suicide attempt detection
Added date cues (regular expression matches that contributed to a date being detected) under the extension ent._.date_cues
Added tables processing in eds.measurement
Added 'all' as possible input in eds.measurement measurements config
Added new units in eds.measurement

Changed

Default to mixed precision inference

Fixed

edsnlp.load("your/huggingface-model", install_dependencies=True) now correctly resolves the python pip
(especially on Colab) to auto-install the model dependencies
We now better handle empty documents in the eds.transformer, eds.text_cnn and eds.ner_crf components
Support mixed precision in eds.text_cnn and eds.ner_crf components
Support pre-quantization (<4.30) transformers versions
Verify that all batches are non empty
Fix span_context_getter for context_words = 0, context_sents > 2 and support assymetric contexts
Don't split sentences on rare unicode symbols
Better detect abbreviations, like E.coli, now split as [E., coli] and not [E, ., coli]

What's Changed

Various ml fixes by @percevalw in #303
TS by @aricohen93 in #269
date cues by @cvinot in #265
Fix fast inference by @percevalw in #305
Fix typo in diabetes patterns by @isabelbt in #306
Fix span context getter by @aricohen93 in #307
Fix sentences by @percevalw in #310
chore: bump version to 0.13.0 by @percevalw in #312

New Contributors

@cvinot made their first contribution in #265
@isabelbt made their first contribution in #306

Full Changelog: v0.12.3...v0.13.0

Contributors

percevalw, cvinot, and 2 other contributors

Assets 2

17 Jun 09:49

percevalw

v0.12.3

040db0b

v0.12.3

Fix model loading messages

Assets 2

16 Jun 23:37

percevalw

v0.12.2

366bc2c

v0.12.2

Changelog

Changed

Packages:

Pip-installable models are now built with hatch instead of poetry, which allows us to expose artifacts (weights)
at the root of the sdist package (uploadable to HF) and move them inside the package upon installation to avoid conflicts.
Dependencies are no longer inferred with dill-magic (this didn't work well before anyway)
Option to perform substitutions in the model's README.md file (e.g., for the model's name, metrics, ...)
Huggingface models are now installed with pip editable installations, which is faster since it doesn't copy around the weights

What's Changed

Better packages by @percevalw in #302

Full Changelog: v0.12.1...v0.12.2

Contributors

percevalw

Assets 2

05 Jun 12:36

percevalw

v0.12.1

c220101

v0.12.1

Changelog

Added

Added binary distribution for linux aarch64 (Streamlit's environment)
Added new separator option in eds.table and new input check

Fixed

Make catalogue & entrypoints compatible with py37-py312
Check that a data has a doc before trying to use the document's note_datetime

Pull Requests

Fix catalogue entrypoints by @percevalw in #297
Adding sep_pattern in eds.tables docstring by @svittoz in #286
chore: bump version to 0.12.1 by @percevalw in #300

Full Changelog: v0.12.0...v0.12.1

Contributors

percevalw and svittoz

Assets 2

21 May 23:27

percevalw

v0.12.0

23e18dc

v0.12.0

Changelog

Added

The eds.transformer component now accepts prompts (passed to its preprocess method, see breaking change below) to add before each window of text to embed.
LazyCollection.map / map_batches now support generator functions as arguments.
Window stride can now be disabled (i.e., stride = window) during training in the eds.transformer component by training_stride = False
Added a new eds.ner_overlap_scorer to evaluate matches between two lists of entities, counting true when the dice overlap is above a given threshold
edsnlp.load now accepts EDS-NLP models from the huggingface hub 🤗 !
New python -m edsnlp.package command to package a model for the huggingface hub or pypi-like registries

Changed

Trainable embedding components now all use foldedtensor to return embeddings, instead of returning a tensor of floats and a mask tensor.
💥 TorchComponent __call__ no longer applies the end to end method, and instead calls the forward method directly, like all torch modules.
The trainable eds.span_qualifier component has been renamed to eds.span_classifier to reflect its general purpose (it doesn't only predict qualifiers, but any attribute of a span using its context or not).
omop converter now takes the note_datetime field into account by default when building a document
span._.date.to_datetime() and span._.date.to_duration() now automatically take the note_datetime into account
nlp.vocab is no longer serialized when saving a model, as it may contain sensitive information and can be recomputed during inference anyway
💥 Major breaking change in trainable components, moving towards a more "task-centric" design:
- the eds.transformer component is no longer responsible for deciding which spans of text ("contexts") should be embedded. These contexts are now passed via the preprocess method, which now accepts more arguments than just the docs to process.
- similarly the eds.span_pooler is now longer responsible for deciding which spans to pool, and instead pools all spans passed to it in the preprocess method.

Consequently, the eds.transformer and eds.span_pooler no longer accept their span_getter argument, and the eds.ner_crf, eds.span_classifier, eds.span_linker and eds.span_qualifier components now accept a context_getter argument instead, as well as a span_getter argument for the latter two. This refactoring can be summarized as follows:

- eds.transformer.span_getter
+ eds.ner_crf.context_getter
+ eds.span_classifier.context_getter
+ eds.span_linker.context_getter

- eds.span_pooler.span_getter
+ eds.span_qualifier.span_getter
+ eds.span_linker.span_getter

and as an example for the eds.span_linker component:

nlp.add_pipe(
    eds.span_linker(
        metric="cosine",
        probability_mode="sigmoid",
+       span_getter="ents",
+       # context_getter="ents",  -> by default, same as span_getter
        embedding=eds.span_pooler(
            hidden_size=128,
-           span_getter="ents",
            embedding=eds.transformer(
-               span_getter="ents",
                model="prajjwal1/bert-tiny",
                window=128,
                stride=96,
            ),
        ),
    ),
    name="linker",
)

Fixed

edsnlp.data.read_json now correctly read the files from the directory passed as an argument, and not from the parent directory.
Overwrite spacy's Doc, Span and Token pickling utils to allow recursively storing Doc, Span and Token objects in the extension values (in particular, span._.date.doc)
Removed pendulum dependency, solving various pickling, multiprocessing and missing attributes errors

Pull Requests

Drop codecov by @percevalw in #292
Fix dates by @percevalw in #288
Loading models from the hf hub by @percevalw in #293
Fix: only reinstall hf model when cache files are changed by @percevalw in #295
feat: expose package script to cli by @percevalw in #294
chore: bump version to 0.12.0 by @percevalw in #296

Full Changelog: v0.11.2...v0.12.0

Contributors

percevalw

Assets 2

10 Apr 14:55

percevalw

v0.11.2

ba08ee4

v0.11.2

Changelog

Fixed

Fix edsnlp.utils.file_system.normalize_fs_path file system detection not working correctly
Improved performance of edsnlp.data methods over a filesystem (fs parameter)

Pull Requests

Fix normalize fs path by @svittoz in #283
Faster fs io by @percevalw in #285

New Contributors

@svittoz made their first contribution in #283

Full Changelog: v0.11.1...v0.11.2

Contributors

percevalw and svittoz

Assets 2

02 Apr 07:54

percevalw

v0.11.1

165bb2c

v0.11.1

Changelog

Added

Automatic estimation of cpu count when using multiprocessing
optim.initialize() method to create optim state before the first backward pass

Changed

nlp.post_init will not tee lazy collections anymore (use edsnlp.utils.collections.multi_tee yourself if needed)

Fixed

Corrected inconsistencies in eds.span_linker

Pull Requests

Fix span linking by @percevalw in #282

Full Changelog: v0.11.0...v0.11.1

Contributors

percevalw

Assets 2

29 Mar 17:38

percevalw

v0.11.0

fdae338

v0.11.0

Changelog

Added

Support for a filesystem parameter in every edsnlp.data.read_* and edsnlp.data.write_* functions
Pipes of a pipeline are now easily accessible with nlp.pipes.xxx instead of nlp.get_pipe("xxx")

Support builtin Span attributes in converters span_attributes parameter, e.g.

import edsnlp

nlp = ...
nlp.add_pipe("eds.sentences")

data = edsnlp.data.from_xxx(...)
data = data.map_pipeline(nlp)
data.to_pandas(converters={"ents": {"span_attributes": ["sent.text", "start", "end"]}})

Support assigning Brat AnnotatorNotes as span attributes: edsnlp.data.read_standoff(..., notes_as_span_attribute="cui")

Support for mapping full batches in edsnlp.processing pipelines with map_batches lazy collection method:

import edsnlp

data = edsnlp.data.from_xxx(...)
data = data.map_batches(lambda batch: do_something(batch))
data.to_pandas()

New data.map_gpu method to map a deep learning operation on some data and take advantage of edsnlp multi-gpu inference capabilities
Added average precision computation in edsnlp span_classification scorer
You can now add pipes to your pipeline by instantiating them directly, which comes with many advantages, such as auto-completion, introspection and type checking !
```
import edsnlp, edsnlp.pipes as eds

nlp = edsnlp.blank("eds")
nlp.add_pipe(eds.sentences())
# instead of nlp.add_pipe("eds.sentences")
```
The previous way of adding pipes is still supported.
New eds.span_linker deep-learning component to match entities with their concepts in a knowledge base, in synonym-similarity or concept-similarity mode.

Changed

nlp.preprocess_many now uses lazy collections to enable parallel processing
⚠️ Breaking change. Improved and simplified eds.span_qualifier: we didn't support combination groups before, so this feature was scrapped for now. We now also support splitting values of a single qualifier between different span labels.
Optimized edsnlp.data batching, especially for large batch sizes (removed a quadratic loop)
⚠️ Breaking change. By default, the name of components added to a pipeline is now the default name defined in their class __init__ signature. For most components of EDS-NLP, this will change the name from "eds.xxx" to "xxx".

Fixed

Flatten list outputs (such as "ents" converter) when iterating: nlp.map(data).to_iterable("ents") is now a list of entities, and not a list of lists of entities
Allow span pooler to choose between multiple base embedding spans (as likely produced by eds.transformer) by sorting them by Dice overlap score.
EDS-NLP does not raise an error anymore when saving a model to an already existing, but empty directory

Pull Requests

Support for a filesystem param in all edsnlp.data readers/writers by @percevalw in #274
Data fixes by @percevalw in #275
Refacto span classification by @percevalw in #276
Entity linking by @percevalw in #280
chore: bump version to 0.11.0 by @percevalw in #281

Full Changelog: v0.10.7...v0.11.0

Contributors

percevalw

Assets 2

12 Mar 21:33

percevalw

v0.10.7

226951d

v0.10.7

Changelog

Added

Support empty converter (by default now) in edsnlp.data writers (do not convert by default)
Add support for polars data import / export
Allow kwargs in eds.transformer to pass to the transformer model

Changed

Saving pipelines now longer saves the disabled status of the pipes (i.e., all pipes are considered "enabled" when saved). This feature was not used and causing issues when saving a model wrapped in a nlp.select_pipes context.

Fixed

Allow missing meta.json, tokenizer and vocab paths when loading saved models
Save torch buffers when dumping machine learning models to disk (previous versions only saved the model parameters)
Fix automatic batch_size estimation in eds.transformer when max_tokens_per_device is set to auto and multiple GPUs are used
Fix JSONL file parsing

Pull Requests

Polars by @percevalw in #270
Various fixes by @percevalw in #271
chore: bump version to 0.10.7 by @percevalw in #272

Full Changelog: v0.10.6...v0.10.7

Contributors

percevalw

Assets 2

24 Feb 23:34

percevalw

v0.10.6

fe808e9

v0.10.6

What's Changed

Added

Added batch_by, split_into_batches_after, sort_chunks, chunk_size, disable_implicit_parallelism parameters to processing (simple and multiprocessing) backends to improve performance
and memory usage. Sorting chunks can improve yield up to twice the speed in some cases.
The deep learning cache mechanism now supports multitask models with weight sharing in multiprocessing mode.
Added max_tokens_per_device="auto" parameter to eds.transformer to estimate memory usage and automatically split the input into chunks that fit into the GPU.

Changed

Improved speed and memory usage of the eds.text_cnn pipe by running the CNN on a non-padded version of its input: expect a speedup up to 1.3x in real-world use cases.
Deprecate the converters' (especially for BRAT/Standoff data) bool_attributes
parameter in favor of general default_attributes. This new mapping describes how to
set attributes on spans for which no attribute value was found in the input format.
This is especially useful for negation, or frequent attributes values (e.g. "negated"
is often False, "temporal" is often "present"), that annotators may not want to
annotate every time.
Default eds.ner_crf window is now set to 40 and stride set to 20, as it doesn't
affect throughput (compared to before, window set to 20) and improves accuracy.
New default overlap_policy='merge' option and parameter renaming in
eds.span_context_getter (which replaces eds.span_sentence_getter)

Fixed

Improved error handling in multiprocessing backend (e.g., no more deadlock)
Various improvements to the data processing related documentation pages
Begin of sentence / end of sentence transitions of the eds.ner_crf component are now
disabled when windows are used (e.g., neither window=1 equivalent to softmax and
window=0equivalent to default full sequence Viterbi decoding)
eds tokenizer nows inherits from spacy.Tokenizer to avoid typing errors
Only match 'ne' negation pattern when not part of another word to avoid false positives cases like u[ne] cure de 10 jours
Disabled pipes are now correctly ignored in the Pipeline.preprocess method
Add "eventuel*" patterns to eds.hyphothesis

Pull Requests

Multi head ml by @percevalw in #257
Default span attributes on data loading by @percevalw in #258
Disable NER CRF BOS/EOS transitions when CRF windows are enabled by @percevalw in #259
Fix "eds" tokenizer base by @percevalw in #260
fix: only match 'ne' negation pattern when not part of another word by @percevalw in #261
Update patterns for hypothesis détection by @LaRiffle in #266
Add overlap_policy='merge' option to make_sentence_span_getter by @percevalw in #262
Fix select pipes by @percevalw in #267
chore: bump version to 0.10.6 by @percevalw in #268

New Contributors

@LaRiffle made their first contribution in #266

Full Changelog: v0.10.5...v0.10.6

Contributors

LaRiffle and percevalw

Assets 2

Releases: aphp/edsnlp

v0.13.0

Changelog

Added

Changed

Fixed

What's Changed

New Contributors

Contributors

v0.12.3

v0.12.2

Changelog

Changed

What's Changed

Contributors

v0.12.1

Changelog

Added

Fixed

Pull Requests

Contributors

v0.12.0

Changelog

Added

Changed

Fixed

Pull Requests

Contributors

v0.11.2

Changelog

Fixed

Pull Requests

New Contributors

Contributors

v0.11.1

Changelog

Added

Changed

Fixed

Pull Requests

Contributors

v0.11.0

Changelog

Added

Changed

Fixed

Pull Requests

Contributors

v0.10.7

Changelog

Added

Changed

Fixed

Pull Requests

Contributors

v0.10.6

What's Changed

Added

Changed

Fixed

Pull Requests

New Contributors

Contributors