Releases: aphp/edsnlp
v0.13.0
Changelog
Added
data.set_processing(...)
now expose anautocast
parameter to disable or tweak the automatic casting of the tensor
during the processing. Autocasting should result in a slight speedup, but may lead to numerical instability.- Use
torch.inference_mode
to disable view tracking and version counter bumps during inference. - Added a new NER pipeline for suicide attempt detection
- Added date cues (regular expression matches that contributed to a date being detected) under the extension
ent._.date_cues
- Added tables processing in eds.measurement
- Added 'all' as possible input in eds.measurement measurements config
- Added new units in eds.measurement
Changed
- Default to mixed precision inference
Fixed
edsnlp.load("your/huggingface-model", install_dependencies=True)
now correctly resolves the python pip
(especially on Colab) to auto-install the model dependencies- We now better handle empty documents in the
eds.transformer
,eds.text_cnn
andeds.ner_crf
components - Support mixed precision in
eds.text_cnn
andeds.ner_crf
components - Support pre-quantization (<4.30) transformers versions
- Verify that all batches are non empty
- Fix
span_context_getter
forcontext_words
= 0,context_sents
> 2 and support assymetric contexts - Don't split sentences on rare unicode symbols
- Better detect abbreviations, like
E.coli
, now split as [E.
,coli
] and not [E
,.
,coli
]
What's Changed
- Various ml fixes by @percevalw in #303
- TS by @aricohen93 in #269
- date cues by @cvinot in #265
- Fix fast inference by @percevalw in #305
- Fix typo in diabetes patterns by @isabelbt in #306
- Fix span context getter by @aricohen93 in #307
- Fix sentences by @percevalw in #310
- chore: bump version to 0.13.0 by @percevalw in #312
New Contributors
Full Changelog: v0.12.3...v0.13.0
v0.12.3
v0.12.2
Changelog
Changed
Packages:
- Pip-installable models are now built with
hatch
instead of poetry, which allows us to exposeartifacts
(weights)
at the root of the sdist package (uploadable to HF) and move them inside the package upon installation to avoid conflicts. - Dependencies are no longer inferred with dill-magic (this didn't work well before anyway)
- Option to perform substitutions in the model's README.md file (e.g., for the model's name, metrics, ...)
- Huggingface models are now installed with pip editable installations, which is faster since it doesn't copy around the weights
What's Changed
- Better packages by @percevalw in #302
Full Changelog: v0.12.1...v0.12.2
v0.12.1
Changelog
Added
- Added binary distribution for linux aarch64 (Streamlit's environment)
- Added new separator option in eds.table and new input check
Fixed
- Make catalogue & entrypoints compatible with py37-py312
- Check that a data has a doc before trying to use the document's
note_datetime
Pull Requests
- Fix catalogue entrypoints by @percevalw in #297
- Adding sep_pattern in eds.tables docstring by @svittoz in #286
- chore: bump version to 0.12.1 by @percevalw in #300
Full Changelog: v0.12.0...v0.12.1
v0.12.0
Changelog
Added
- The
eds.transformer
component now acceptsprompts
(passed to itspreprocess
method, see breaking change below) to add before each window of text to embed. LazyCollection.map
/map_batches
now support generator functions as arguments.- Window stride can now be disabled (i.e., stride = window) during training in the
eds.transformer
component bytraining_stride = False
- Added a new
eds.ner_overlap_scorer
to evaluate matches between two lists of entities, counting true when the dice overlap is above a given threshold edsnlp.load
now accepts EDS-NLP models from the huggingface hub 🤗 !- New
python -m edsnlp.package
command to package a model for the huggingface hub or pypi-like registries
Changed
- Trainable embedding components now all use
foldedtensor
to return embeddings, instead of returning a tensor of floats and a mask tensor. - 💥 TorchComponent
__call__
no longer applies the end to end method, and instead calls theforward
method directly, like all torch modules. - The trainable
eds.span_qualifier
component has been renamed toeds.span_classifier
to reflect its general purpose (it doesn't only predict qualifiers, but any attribute of a span using its context or not). omop
converter now takes thenote_datetime
field into account by default when building a documentspan._.date.to_datetime()
andspan._.date.to_duration()
now automatically take thenote_datetime
into accountnlp.vocab
is no longer serialized when saving a model, as it may contain sensitive information and can be recomputed during inference anyway- 💥 Major breaking change in trainable components, moving towards a more "task-centric" design:
- the
eds.transformer
component is no longer responsible for deciding which spans of text ("contexts") should be embedded. These contexts are now passed via thepreprocess
method, which now accepts more arguments than just the docs to process. - similarly the
eds.span_pooler
is now longer responsible for deciding which spans to pool, and instead pools all spans passed to it in thepreprocess
method.
- the
Consequently, the eds.transformer
and eds.span_pooler
no longer accept their span_getter
argument, and the eds.ner_crf
, eds.span_classifier
, eds.span_linker
and eds.span_qualifier
components now accept a context_getter
argument instead, as well as a span_getter
argument for the latter two. This refactoring can be summarized as follows:
- eds.transformer.span_getter
+ eds.ner_crf.context_getter
+ eds.span_classifier.context_getter
+ eds.span_linker.context_getter
- eds.span_pooler.span_getter
+ eds.span_qualifier.span_getter
+ eds.span_linker.span_getter
and as an example for the eds.span_linker
component:
nlp.add_pipe(
eds.span_linker(
metric="cosine",
probability_mode="sigmoid",
+ span_getter="ents",
+ # context_getter="ents", -> by default, same as span_getter
embedding=eds.span_pooler(
hidden_size=128,
- span_getter="ents",
embedding=eds.transformer(
- span_getter="ents",
model="prajjwal1/bert-tiny",
window=128,
stride=96,
),
),
),
name="linker",
)
Fixed
edsnlp.data.read_json
now correctly read the files from the directory passed as an argument, and not from the parent directory.- Overwrite spacy's Doc, Span and Token pickling utils to allow recursively storing Doc, Span and Token objects in the extension values (in particular, span._.date.doc)
- Removed pendulum dependency, solving various pickling, multiprocessing and missing attributes errors
Pull Requests
- Drop codecov by @percevalw in #292
- Fix dates by @percevalw in #288
- Loading models from the hf hub by @percevalw in #293
- Fix: only reinstall hf model when cache files are changed by @percevalw in #295
- feat: expose package script to cli by @percevalw in #294
- chore: bump version to 0.12.0 by @percevalw in #296
Full Changelog: v0.11.2...v0.12.0
v0.11.2
Changelog
Fixed
- Fix
edsnlp.utils.file_system.normalize_fs_path
file system detection not working correctly - Improved performance of
edsnlp.data
methods over a filesystem (fs
parameter)
Pull Requests
- Fix normalize fs path by @svittoz in #283
- Faster fs io by @percevalw in #285
New Contributors
Full Changelog: v0.11.1...v0.11.2
v0.11.1
Changelog
Added
- Automatic estimation of cpu count when using multiprocessing
optim.initialize()
method to create optim state before the first backward pass
Changed
nlp.post_init
will not tee lazy collections anymore (useedsnlp.utils.collections.multi_tee
yourself if needed)
Fixed
- Corrected inconsistencies in
eds.span_linker
Pull Requests
- Fix span linking by @percevalw in #282
Full Changelog: v0.11.0...v0.11.1
v0.11.0
Changelog
Added
-
Support for a
filesystem
parameter in everyedsnlp.data.read_*
andedsnlp.data.write_*
functions -
Pipes of a pipeline are now easily accessible with
nlp.pipes.xxx
instead ofnlp.get_pipe("xxx")
-
Support builtin Span attributes in converters
span_attributes
parameter, e.g.import edsnlp nlp = ... nlp.add_pipe("eds.sentences") data = edsnlp.data.from_xxx(...) data = data.map_pipeline(nlp) data.to_pandas(converters={"ents": {"span_attributes": ["sent.text", "start", "end"]}})
-
Support assigning Brat AnnotatorNotes as span attributes:
edsnlp.data.read_standoff(..., notes_as_span_attribute="cui")
-
Support for mapping full batches in
edsnlp.processing
pipelines withmap_batches
lazy collection method:import edsnlp data = edsnlp.data.from_xxx(...) data = data.map_batches(lambda batch: do_something(batch)) data.to_pandas()
-
New
data.map_gpu
method to map a deep learning operation on some data and take advantage of edsnlp multi-gpu inference capabilities -
Added average precision computation in edsnlp span_classification scorer
-
You can now add pipes to your pipeline by instantiating them directly, which comes with many advantages, such as auto-completion, introspection and type checking !
import edsnlp, edsnlp.pipes as eds nlp = edsnlp.blank("eds") nlp.add_pipe(eds.sentences()) # instead of nlp.add_pipe("eds.sentences")
The previous way of adding pipes is still supported.
-
New
eds.span_linker
deep-learning component to match entities with their concepts in a knowledge base, in synonym-similarity or concept-similarity mode.
Changed
nlp.preprocess_many
now uses lazy collections to enable parallel processing⚠️ Breaking change. Improved and simplifiededs.span_qualifier
: we didn't support combination groups before, so this feature was scrapped for now. We now also support splitting values of a single qualifier between different span labels.- Optimized edsnlp.data batching, especially for large batch sizes (removed a quadratic loop)
⚠️ Breaking change. By default, the name of components added to a pipeline is now the default name defined in their class__init__
signature. For most components of EDS-NLP, this will change the name from "eds.xxx" to "xxx".
Fixed
- Flatten list outputs (such as "ents" converter) when iterating:
nlp.map(data).to_iterable("ents")
is now a list of entities, and not a list of lists of entities - Allow span pooler to choose between multiple base embedding spans (as likely produced by
eds.transformer
) by sorting them by Dice overlap score. - EDS-NLP does not raise an error anymore when saving a model to an already existing, but empty directory
Pull Requests
- Support for a filesystem param in all edsnlp.data readers/writers by @percevalw in #274
- Data fixes by @percevalw in #275
- Refacto span classification by @percevalw in #276
- Entity linking by @percevalw in #280
- chore: bump version to 0.11.0 by @percevalw in #281
Full Changelog: v0.10.7...v0.11.0
v0.10.7
Changelog
Added
- Support empty
converter
(by default now) inedsnlp.data
writers (do not convert by default) - Add support for polars data import / export
- Allow kwargs in
eds.transformer
to pass to the transformer model
Changed
- Saving pipelines now longer saves the
disabled
status of the pipes (i.e., all pipes are considered "enabled" when saved). This feature was not used and causing issues when saving a model wrapped in anlp.select_pipes
context.
Fixed
- Allow missing
meta.json
,tokenizer
andvocab
paths when loading saved models - Save torch buffers when dumping machine learning models to disk (previous versions only saved the model parameters)
- Fix automatic
batch_size
estimation ineds.transformer
whenmax_tokens_per_device
is set toauto
and multiple GPUs are used - Fix JSONL file parsing
Pull Requests
- Polars by @percevalw in #270
- Various fixes by @percevalw in #271
- chore: bump version to 0.10.7 by @percevalw in #272
Full Changelog: v0.10.6...v0.10.7
v0.10.6
What's Changed
Added
- Added
batch_by
,split_into_batches_after
,sort_chunks
,chunk_size
,disable_implicit_parallelism
parameters to processing (simple
andmultiprocessing
) backends to improve performance
and memory usage. Sorting chunks can improve yield up to twice the speed in some cases. - The deep learning cache mechanism now supports multitask models with weight sharing in multiprocessing mode.
- Added
max_tokens_per_device="auto"
parameter toeds.transformer
to estimate memory usage and automatically split the input into chunks that fit into the GPU.
Changed
- Improved speed and memory usage of the
eds.text_cnn
pipe by running the CNN on a non-padded version of its input: expect a speedup up to 1.3x in real-world use cases. - Deprecate the converters' (especially for BRAT/Standoff data)
bool_attributes
parameter in favor of generaldefault_attributes
. This new mapping describes how to
set attributes on spans for which no attribute value was found in the input format.
This is especially useful for negation, or frequent attributes values (e.g. "negated"
is often False, "temporal" is often "present"), that annotators may not want to
annotate every time. - Default
eds.ner_crf
window is now set to 40 and stride set to 20, as it doesn't
affect throughput (compared to before, window set to 20) and improves accuracy. - New default
overlap_policy='merge'
option and parameter renaming in
eds.span_context_getter
(which replaceseds.span_sentence_getter
)
Fixed
- Improved error handling in
multiprocessing
backend (e.g., no more deadlock) - Various improvements to the data processing related documentation pages
- Begin of sentence / end of sentence transitions of the
eds.ner_crf
component are now
disabled when windows are used (e.g., neitherwindow=1
equivalent to softmax and
window=0
equivalent to default full sequence Viterbi decoding) eds
tokenizer nows inherits fromspacy.Tokenizer
to avoid typing errors- Only match 'ne' negation pattern when not part of another word to avoid false positives cases like
u[ne] cure de 10 jours
- Disabled pipes are now correctly ignored in the
Pipeline.preprocess
method - Add "eventuel*" patterns to
eds.hyphothesis
Pull Requests
- Multi head ml by @percevalw in #257
- Default span attributes on data loading by @percevalw in #258
- Disable NER CRF BOS/EOS transitions when CRF windows are enabled by @percevalw in #259
- Fix "eds" tokenizer base by @percevalw in #260
- fix: only match 'ne' negation pattern when not part of another word by @percevalw in #261
- Update patterns for hypothesis détection by @LaRiffle in #266
- Add overlap_policy='merge' option to make_sentence_span_getter by @percevalw in #262
- Fix select pipes by @percevalw in #267
- chore: bump version to 0.10.6 by @percevalw in #268
New Contributors
Full Changelog: v0.10.5...v0.10.6