Skip to content

Releases: aphp/edsnlp

v0.13.0

22 Jul 16:26
Compare
Choose a tag to compare

Changelog

Added

  • data.set_processing(...) now expose an autocast parameter to disable or tweak the automatic casting of the tensor
    during the processing. Autocasting should result in a slight speedup, but may lead to numerical instability.
  • Use torch.inference_mode to disable view tracking and version counter bumps during inference.
  • Added a new NER pipeline for suicide attempt detection
  • Added date cues (regular expression matches that contributed to a date being detected) under the extension ent._.date_cues
  • Added tables processing in eds.measurement
  • Added 'all' as possible input in eds.measurement measurements config
  • Added new units in eds.measurement

Changed

  • Default to mixed precision inference

Fixed

  • edsnlp.load("your/huggingface-model", install_dependencies=True) now correctly resolves the python pip
    (especially on Colab) to auto-install the model dependencies
  • We now better handle empty documents in the eds.transformer, eds.text_cnn and eds.ner_crf components
  • Support mixed precision in eds.text_cnn and eds.ner_crf components
  • Support pre-quantization (<4.30) transformers versions
  • Verify that all batches are non empty
  • Fix span_context_getter for context_words = 0, context_sents > 2 and support assymetric contexts
  • Don't split sentences on rare unicode symbols
  • Better detect abbreviations, like E.coli, now split as [E., coli] and not [E, ., coli]

What's Changed

New Contributors

Full Changelog: v0.12.3...v0.13.0

v0.12.3

17 Jun 09:49
Compare
Choose a tag to compare

Fix model loading messages

v0.12.2

16 Jun 23:37
Compare
Choose a tag to compare

Changelog

Changed

Packages:

  • Pip-installable models are now built with hatch instead of poetry, which allows us to expose artifacts (weights)
    at the root of the sdist package (uploadable to HF) and move them inside the package upon installation to avoid conflicts.
  • Dependencies are no longer inferred with dill-magic (this didn't work well before anyway)
  • Option to perform substitutions in the model's README.md file (e.g., for the model's name, metrics, ...)
  • Huggingface models are now installed with pip editable installations, which is faster since it doesn't copy around the weights

What's Changed

Full Changelog: v0.12.1...v0.12.2

v0.12.1

05 Jun 12:36
Compare
Choose a tag to compare

Changelog

Added

  • Added binary distribution for linux aarch64 (Streamlit's environment)
  • Added new separator option in eds.table and new input check

Fixed

  • Make catalogue & entrypoints compatible with py37-py312
  • Check that a data has a doc before trying to use the document's note_datetime

Pull Requests

Full Changelog: v0.12.0...v0.12.1

v0.12.0

21 May 23:27
Compare
Choose a tag to compare

Changelog

Added

  • The eds.transformer component now accepts prompts (passed to its preprocess method, see breaking change below) to add before each window of text to embed.
  • LazyCollection.map / map_batches now support generator functions as arguments.
  • Window stride can now be disabled (i.e., stride = window) during training in the eds.transformer component by training_stride = False
  • Added a new eds.ner_overlap_scorer to evaluate matches between two lists of entities, counting true when the dice overlap is above a given threshold
  • edsnlp.load now accepts EDS-NLP models from the huggingface hub 🤗 !
  • New python -m edsnlp.package command to package a model for the huggingface hub or pypi-like registries

Changed

  • Trainable embedding components now all use foldedtensor to return embeddings, instead of returning a tensor of floats and a mask tensor.
  • 💥 TorchComponent __call__ no longer applies the end to end method, and instead calls the forward method directly, like all torch modules.
  • The trainable eds.span_qualifier component has been renamed to eds.span_classifier to reflect its general purpose (it doesn't only predict qualifiers, but any attribute of a span using its context or not).
  • omop converter now takes the note_datetime field into account by default when building a document
  • span._.date.to_datetime() and span._.date.to_duration() now automatically take the note_datetime into account
  • nlp.vocab is no longer serialized when saving a model, as it may contain sensitive information and can be recomputed during inference anyway
  • 💥 Major breaking change in trainable components, moving towards a more "task-centric" design:
    • the eds.transformer component is no longer responsible for deciding which spans of text ("contexts") should be embedded. These contexts are now passed via the preprocess method, which now accepts more arguments than just the docs to process.
    • similarly the eds.span_pooler is now longer responsible for deciding which spans to pool, and instead pools all spans passed to it in the preprocess method.

Consequently, the eds.transformer and eds.span_pooler no longer accept their span_getter argument, and the eds.ner_crf, eds.span_classifier, eds.span_linker and eds.span_qualifier components now accept a context_getter argument instead, as well as a span_getter argument for the latter two. This refactoring can be summarized as follows:

- eds.transformer.span_getter
+ eds.ner_crf.context_getter
+ eds.span_classifier.context_getter
+ eds.span_linker.context_getter

- eds.span_pooler.span_getter
+ eds.span_qualifier.span_getter
+ eds.span_linker.span_getter

and as an example for the eds.span_linker component:

nlp.add_pipe(
    eds.span_linker(
        metric="cosine",
        probability_mode="sigmoid",
+       span_getter="ents",
+       # context_getter="ents",  -> by default, same as span_getter
        embedding=eds.span_pooler(
            hidden_size=128,
-           span_getter="ents",
            embedding=eds.transformer(
-               span_getter="ents",
                model="prajjwal1/bert-tiny",
                window=128,
                stride=96,
            ),
        ),
    ),
    name="linker",
)

Fixed

  • edsnlp.data.read_json now correctly read the files from the directory passed as an argument, and not from the parent directory.
  • Overwrite spacy's Doc, Span and Token pickling utils to allow recursively storing Doc, Span and Token objects in the extension values (in particular, span._.date.doc)
  • Removed pendulum dependency, solving various pickling, multiprocessing and missing attributes errors

Pull Requests

Full Changelog: v0.11.2...v0.12.0

v0.11.2

10 Apr 14:55
Compare
Choose a tag to compare

Changelog

Fixed

  • Fix edsnlp.utils.file_system.normalize_fs_path file system detection not working correctly
  • Improved performance of edsnlp.data methods over a filesystem (fs parameter)

Pull Requests

New Contributors

Full Changelog: v0.11.1...v0.11.2

v0.11.1

02 Apr 07:54
Compare
Choose a tag to compare

Changelog

Added

  • Automatic estimation of cpu count when using multiprocessing
  • optim.initialize() method to create optim state before the first backward pass

Changed

  • nlp.post_init will not tee lazy collections anymore (use edsnlp.utils.collections.multi_tee yourself if needed)

Fixed

  • Corrected inconsistencies in eds.span_linker

Pull Requests

Full Changelog: v0.11.0...v0.11.1

v0.11.0

29 Mar 17:38
Compare
Choose a tag to compare

Changelog

Added

  • Support for a filesystem parameter in every edsnlp.data.read_* and edsnlp.data.write_* functions

  • Pipes of a pipeline are now easily accessible with nlp.pipes.xxx instead of nlp.get_pipe("xxx")

  • Support builtin Span attributes in converters span_attributes parameter, e.g.

    import edsnlp
    
    nlp = ...
    nlp.add_pipe("eds.sentences")
    
    data = edsnlp.data.from_xxx(...)
    data = data.map_pipeline(nlp)
    data.to_pandas(converters={"ents": {"span_attributes": ["sent.text", "start", "end"]}})
  • Support assigning Brat AnnotatorNotes as span attributes: edsnlp.data.read_standoff(..., notes_as_span_attribute="cui")

  • Support for mapping full batches in edsnlp.processing pipelines with map_batches lazy collection method:

    import edsnlp
    
    data = edsnlp.data.from_xxx(...)
    data = data.map_batches(lambda batch: do_something(batch))
    data.to_pandas()
  • New data.map_gpu method to map a deep learning operation on some data and take advantage of edsnlp multi-gpu inference capabilities

  • Added average precision computation in edsnlp span_classification scorer

  • You can now add pipes to your pipeline by instantiating them directly, which comes with many advantages, such as auto-completion, introspection and type checking !

    import edsnlp, edsnlp.pipes as eds
    
    nlp = edsnlp.blank("eds")
    nlp.add_pipe(eds.sentences())
    # instead of nlp.add_pipe("eds.sentences")

    The previous way of adding pipes is still supported.

  • New eds.span_linker deep-learning component to match entities with their concepts in a knowledge base, in synonym-similarity or concept-similarity mode.

Changed

  • nlp.preprocess_many now uses lazy collections to enable parallel processing
  • ⚠️ Breaking change. Improved and simplified eds.span_qualifier: we didn't support combination groups before, so this feature was scrapped for now. We now also support splitting values of a single qualifier between different span labels.
  • Optimized edsnlp.data batching, especially for large batch sizes (removed a quadratic loop)
  • ⚠️ Breaking change. By default, the name of components added to a pipeline is now the default name defined in their class __init__ signature. For most components of EDS-NLP, this will change the name from "eds.xxx" to "xxx".

Fixed

  • Flatten list outputs (such as "ents" converter) when iterating: nlp.map(data).to_iterable("ents") is now a list of entities, and not a list of lists of entities
  • Allow span pooler to choose between multiple base embedding spans (as likely produced by eds.transformer) by sorting them by Dice overlap score.
  • EDS-NLP does not raise an error anymore when saving a model to an already existing, but empty directory

Pull Requests

Full Changelog: v0.10.7...v0.11.0

v0.10.7

12 Mar 21:33
Compare
Choose a tag to compare

Changelog

Added

  • Support empty converter (by default now) in edsnlp.data writers (do not convert by default)
  • Add support for polars data import / export
  • Allow kwargs in eds.transformer to pass to the transformer model

Changed

  • Saving pipelines now longer saves the disabled status of the pipes (i.e., all pipes are considered "enabled" when saved). This feature was not used and causing issues when saving a model wrapped in a nlp.select_pipes context.

Fixed

  • Allow missing meta.json, tokenizer and vocab paths when loading saved models
  • Save torch buffers when dumping machine learning models to disk (previous versions only saved the model parameters)
  • Fix automatic batch_size estimation in eds.transformer when max_tokens_per_device is set to auto and multiple GPUs are used
  • Fix JSONL file parsing

Pull Requests

Full Changelog: v0.10.6...v0.10.7

v0.10.6

24 Feb 23:34
Compare
Choose a tag to compare

What's Changed

Added

  • Added batch_by, split_into_batches_after, sort_chunks, chunk_size, disable_implicit_parallelism parameters to processing (simple and multiprocessing) backends to improve performance
    and memory usage. Sorting chunks can improve yield up to twice the speed in some cases.
  • The deep learning cache mechanism now supports multitask models with weight sharing in multiprocessing mode.
  • Added max_tokens_per_device="auto" parameter to eds.transformer to estimate memory usage and automatically split the input into chunks that fit into the GPU.

Changed

  • Improved speed and memory usage of the eds.text_cnn pipe by running the CNN on a non-padded version of its input: expect a speedup up to 1.3x in real-world use cases.
  • Deprecate the converters' (especially for BRAT/Standoff data) bool_attributes
    parameter in favor of general default_attributes. This new mapping describes how to
    set attributes on spans for which no attribute value was found in the input format.
    This is especially useful for negation, or frequent attributes values (e.g. "negated"
    is often False, "temporal" is often "present"), that annotators may not want to
    annotate every time.
  • Default eds.ner_crf window is now set to 40 and stride set to 20, as it doesn't
    affect throughput (compared to before, window set to 20) and improves accuracy.
  • New default overlap_policy='merge' option and parameter renaming in
    eds.span_context_getter (which replaces eds.span_sentence_getter)

Fixed

  • Improved error handling in multiprocessing backend (e.g., no more deadlock)
  • Various improvements to the data processing related documentation pages
  • Begin of sentence / end of sentence transitions of the eds.ner_crf component are now
    disabled when windows are used (e.g., neither window=1 equivalent to softmax and
    window=0equivalent to default full sequence Viterbi decoding)
  • eds tokenizer nows inherits from spacy.Tokenizer to avoid typing errors
  • Only match 'ne' negation pattern when not part of another word to avoid false positives cases like u[ne] cure de 10 jours
  • Disabled pipes are now correctly ignored in the Pipeline.preprocess method
  • Add "eventuel*" patterns to eds.hyphothesis

Pull Requests

New Contributors

Full Changelog: v0.10.5...v0.10.6