Releases: explosion/spaCy
v3.1.3: Bug fixes and UX updates
✨ New features and improvements
- The
v3
ofWandbLogger
now supports optionalrun_name
andentity
parameters. - Improved UX when providing invalid
pos
values for aDoc
orToken
.
🔴 Bug fixes
- Fix issue #9001: Pass alignments to
Matcher
callbacks. - Fix issue #9009: Include component factories in third-party dependencies resolver.
- Fix issue #9012: Correct type of
config
increate_pipe
. - Fix issue #9014: Allow
typer
0.4 to provide support for both Click 7 and Click 8. - Fix issue #9033: Fix verbs list for French tokenizer exceptions.
- Fix issue #9059: Pass overrides to subcommands in
spacy project
workflows. - Fix issue #9074: Improve UX around
repo
andpath
arguments inspacy project
. - Fix issue #9084: Fix inference of
epoch_resume
inspacy pretrain
. - Fix issue #9163: Handle
spacy-legacy
inspacy package
dependency detection. - Fix issue #9211: Include only runtime-relevant dependencies in
spacy package
.
📖 Documentation and examples
- Various updates to the documentation.
- Few additions and updates to the spaCy universe.
- Extended the developer documentation with information about the listener pattern, the
StringStore
and theVocab
.
👥 Contributors
@adrianeboyd, @davidefiocco, @davidstrouk, @filipematos95, @honnibal, @ines, @j-frei, @Joozty, @kwhumphreys, @mjhajharia, @mylibrar, @polm, @rspeer, @shigapov, @svlandeg, @thomashacker
v3.1.2: Improved spancat component and various bugfixes
✨ New features and improvements
- NEW: Provide scores for the
SpanCategorizer
predictions. - NEW: Broader compatibility with type checkers thanks to
.pyi
stub files. - NEW: Auto-detect package dependencies in
spacy package
. - New
INTERSECTS
operator for the Matcher. - More debugging info for
spacy project
push
andpull
commands. - Allow passing in a precomputed array for speeding up multiple
Span.as_doc
calls. - The default
da
transformer is now the same as the one from the trained pipelines (Maltehb/danish-bert-botxo
).
🔴 Bug fixes
- Fix issue #8767: Fix offsets of empty and out-of-bounds spans.
- Fix issue #8774: Ensure
debug data
runs correctly with a custom tokenizer. - Fix issue #8784: Fix incorrect
ISSUBSET
andISSUPERSET
in schema and docs. - Fix issue #8796: Respect the
no_skip
value forspacy project run
. - Fix issue #8810: Make
ConsoleLogger
flush after each logging line. - Fix issue #8819: Pass
exclude
when serializing the vocab. - Fix issue #8830: Avoid adding sourced vectors hashes if not necessary.
- Fix issue #8970: Fix
allow_overlap
default for span categorizer scoring. - Fix issue #8982: Add glossary entry for
_SP
. - Fix issue #9007: Fix span categorizer training on nested entities.
📖 Documentation and examples
- New developer documentation covering spaCy's internals and code conventions.
- Added a documentation section on preparing training data in spaCy's binary format.
- Updated some error/log messages to be more informative.
- Various updates to the documentation.
- A few new additions to the spaCy universe.
👥 Contributors
@adrianeboyd, @bbieniek, @DuyguA, @ezorita, @HLasse, @honnibal, @ines, @kabirkhan, @kevinlu1248, @ldorigo, @Ledenel, @nsorros, @polm, @svlandeg, @swfarnsworth, @themrmax, @thomashacker
v3.0.7: Bug fixes and base support for Azerbaijani
✨ New features and improvements
- Alpha tokenization support for Azerbaijani.
- Updates for French stop words.
🔴 Bug fixes
- Fix issue #7629: Fix scoring normalization.
- Fix issue #7886: Fix unknown tokens percentage in
debug data
. - Fix issue #7907: Update
load_lookups
return type and docstring. - Fix issue #7930: Make
EntityLinker
robust fornO=None
. - Fix issue #7925: Skip vector ngram backoff if
minn
is not set. - Fix issue #7973: Fix
debug model
for transformers. - Fix issue #7988: Preserve existing
ENT_KB_ID
inner
annotation. - Fix issue #7992: Fix span offsets for
Matcher(as_spans)
on spans. - Fix issue #8004: Handle errors while multiprocessing.
- Fix issue #8009: Fix
Doc.from_docs()
for all empty docs. - Fix issue #8012: Fix ensemble
textcat
with listener. - Fix issue #8054: Add
ENT_ID
andNORM
toDocBin
strings. - Fix issue #8055: Handle partial entities in
Span.as_doc
. - Fix issue #8062: Make all
Span
attrs writable. - Fix issue #8066: Update
debug data
fortextcat
. - Fix issue #8069: Custom warning if
DocBin
is too large. - Fix issue #8113: Support
to/from_bytes
forKnowledgeBase
andEntityLinker
. - Fix issue #8116: Fix offsets in
Span.get_lca_matrix
. - Fix issue #8132: Remove unsupported attrs from
attrs.IDS
. - Fix issue #8158: Ensure tolerance is passed on in
spacy.batch_by_words.v1
. - Fix issue #8169: Fix bug from
EntityRuler
:ent_ids
returnsNone
for phrases. - Fix issue #8208: Address missing config overrides post load of models.
- Fix issue #8212: Add all symbols in Unicode Currency Symbols to currency characters.
- Fix issue #8216: Don't add duplicate patterns in
EntityRuler
. - Fix issue #8244: Use context manager when reading model file.
- Fix issue #8245: Fix other open calls without context managers.
- Fix issue #8265: Address mypy errors.
- Fix issue #8299: Restrict
pymorphy2
requirement topymorphy2
mode in Russian and Ukrainian lemmatizers. - Fix issue #8335: Raise error if deps not provided with heads in
Doc
. - Fix issue #8368: Preserve whitespace in
Span.lemma_
. - Fix issue #8396: Make
JsonlReader
path optional. - Fix issue #8421: Fix non-deterministic deduplication in Greek lemmatizer.
- Fix issue #8423: Update validate CLI to fix compat and ignore warnings.
- Fix issue #8426: Fix setting empty entities in
Example.from_dict
. - Fix issue #8487: Fix span offsets and keys in
Doc.from_docs
. - Fix issue #8584: Raise an error for
textcat
with <2 labels. - Fix issue #8551: Fix duplicate spacy package CLI opts.
👥 Contributors
@adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @fhopp, @frascuchon, @graue70, @ines, @jenojp, @jhroy, @jklaise, @juliensalinas, @meghanabhange, @michael-k, @narayanacharya6, @polm, @sevdimali, @svlandeg, @ZeeD
v3.1.1: Support for Ancient Greek and various bug fixes
✨ New features and improvements
- Alpha tokenization support for Ancient Greek.
- Implementation of a
noun_chunk
iterator for Dutch. - Support for
black
&flake8
as pre-commit hooks. - New
spacy.ngram_range_suggester.v1
for suggesting a range of n-gram sizes for thespancat
component.
🔴 Bug fixes
- Fix issue #8638: Fix Azerbaijani initialization.
- Fix issue #8639: Use 0-vector for OOV lexemes.
- Fix issue #8640: Update lexeme ranks for loaded vectors.
- Fix issue #8651: Fix
ru
anduk
multiprocessing (withspawn
). - Fix issue #8663: Preserve existing
meta
information withspacy package
. - Fix issue #8718: Ensure that
replace_pipe
takes disabled components into account.
👥 Contributors
@adrianeboyd, @honnibal, @ines, @jmyerston, @julien-talkair, @KennethEnevoldsen, @mariosasko, @mylibrar, @polm, @rynoV, @svlandeg, @thomashacker, @yohasebe
v3.1.0: New pipelines for Catalan & Danish, SpanCategorizer for arbitrary overlapping spans, use predicted annotations during training, bug fixes & more
✨ New features and improvements
- NEW: Trained pipelines for Catalan and a new transformer-based pipeline for Danish.
- NEW: Experimental
SpanCategorizer
component for labeling arbitrary and potentially overlapping spans of text. - NEW: Use predicted annotations during training via the
[training.annotating_components]
config setting. - Alpha tokenization support for Azerbaijani.
- Part-of-speech tag-based lemmatizers for Catalan and Italian.
- The TextCatCNN and TextCatBOW architectures are now resizable.
- Support updating the
EntityRecognizer
with known incorrect span annotations. - Auto-generate a pretty
README.md
based on the meta inspacy package
.
For more details, see the New in v3.1 usage guide.
📦 New trained pipelines
Package | Language | UPOS | Parser LAS | NER F |
---|---|---|---|---|
ca_core_news_sm |
Catalan | 98.2 | 87.4 | 79.8 |
ca_core_news_md |
Catalan | 98.3 | 88.2 | 84.0 |
ca_core_news_lg |
Catalan | 98.5 | 88.4 | 84.2 |
ca_core_news_trf |
Catalan | 98.9 | 93.0 | 91.2 |
da_core_news_trf |
Danish | 98.0 | 85.0 | 82.9 |
⚠️ Upgrading from v3.0
- Due to the use of configs with extensive versioning, v3.0 pipelines should be compatible with v3.1, however you may see slight differences in performance. Test your v3.0 pipeline with v3.1 against your test suite and if the performance is identical, extend the
spacy_version
in your model package meta to">=3.0.0,<3.2.0"
. If you run into degraded performance, retrain your pipeline with v3.1. - Use
spacy init fill-config
to update a v3.0 config for v3.1. - When sourcing a pipeline component that requires static vectors, it is now required to include the source model's vectors in
[initialize.vectors]
. - Logger warnings have been converted to Python warnings. Use
warnings.filterwarnings
or the new helper methodspacy.errors.filter_warning(action, error_msg='')
to manage warnings.
For more information, see Notes on upgrading from v3.0.
🔴 Bug fixes
- Fix issue #7036: Use a context manager when reading model.
- Fix issue #7629: Fix scoring normalization.
- Fix issue #7799: Ensure
spacy ray
command works. - Fix issue #7807: Show warning if entity ruler runs without patterns.
- Fix issue #7886: Fix unknown tokens percentage in
debug data
. - Fix issue #7930: Make
EntityLinker
robust for nO=None. - Fix issue #7925: Skip vector ngram backoff if
minn
is not set. - Fix issue #7973: Fix
debug model
for transformers. - Fix issue #7988: Preserve existing
ENT_KB_ID
inner
annotation. - Fix issue #8004: Handle errors while multiprocessing.
- Fix issue #8009: Fix
Doc.from_docs()
for all empty docs. - Fix issue #8012: Fix ensemble
textcat
with listener. - Fix issue #8054: Add
ENT_ID
andNORM
toDocBin
strings. - Fix issue #8055: Handle partial entities in
Span.as_doc
. - Fix issue #8062: Make all
Span
attrs writable. - Fix issue #8066: Update
debug data
fortextcat
. - Fix issue #8069: Custom warning if
DocBin
is too large. - Fix issue #8099: Update Vietnamese tokenizer.
- Fix issue #8113: Support
to/from_bytes
forKnowledgeBase
andEntityLinker
. - Fix issue #8116: Fix offsets in
Span.get_lca_matrix
. - Fix issue #8132: Remove unsupported attrs from
attrs.IDS
. - Fix issue #8158: Ensure tolerance is passed on in
spacy.batch_by_words.v1
. - Fix issue #8169: Fix bug from
EntityRuler
:ent_ids
returns None for phrases. - Fix issue #8208: Address missing config overrides post load of models.
- Fix issue #8212: Add all symbols in Unicode Currency Symbols to currency characters.
- Fix issue #8216: Don't add duplicate patterns in
EntityRuler
. - Fix issue #8265: Address mypy errors.
- Fix issue #8335: Raise error if deps not provided with heads in
Doc
. - Fix issue #8368: Preserve whitespace in
Span.lemma_
. - Fix issue #8388: Don't clobber vectors when loading components from source models.
- Fix issue #8421: Fix non-deterministic deduplication in Greek lemmatizer.
- Fix issue #8426: Fix setting empty entities in
Example.from_dict
. - Fix issue #8441: Add correct types for
Language.pipe
return values. - Fix issue #8487: Fix span offsets and keys in
Doc.from_docs
. - Fix issue #8559: Fix vectors check for sourced components.
- Fix issue #8584: Raise an error for
textcat
with <2 labels.
👥 Contributors
@aajanki, @adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @explosion-bot, @fhopp, @frascuchon, @graue70, @gtoffoli, @honnibal, @ines, @jacopofar, @jenojp, @jhroy, @jklaise, @juliensalinas, @kevinlu1248, @ldorigo, @mathcass, @meghanabhange, @michael-k, @narayanacharya6, @NirantK, @nsorros, @polm, @sevdimali, @svlandeg, @themrmax, @xadrianzetx, @yohasebe, @ZeeD
v2.3.7: Bug fix for download CLI
🔴 Bug fixes
- Fix issue #8286: Fix
spacy download
.
v2.3.6: Bug fixes and base support for Amharic
✨ New features and improvements
- Add base support for Amharic.
- Add noun chunk iterator for Danish.
- Updates to French, Portuguese and Romanian stop words.
🔴 Bug fixes
- Fix issue #6705: Fix deserialization of null
token_match
andurl_match
for the tokenizer. - Fix issue #6712: Prevent overlapping noun chunks for Spanish.
- Fix issue #6745: Fix minibatch iterator when size iterator is finished.
- Fix issue #6759: Skip 0-length matches in the
Matcher
. - Fix issue #6771: Support
IS_SENT_START
in thePhraseMatcher
. - Fix issue #6772: Fix
Span.text
for empty spans. - Fix issue #6820: Improve
Doc.char_span
alignment_mode
handling. - Fix issue #6857: Remove
--no-cache-dir
when downloading models. - Fix issue #8115: Fix offsets in
Span.get_lca_matrix
.
👥 Contributors
Thanks to @alexcombessie, @AMArostegui, @bryant1410, @Cristianasp, @garethsparks, @jenojp, @jganseman, @jumasheff, @lorenanda, @ophelielacroix, @thomasbird, @timgates42, @tupui and @yosiasz for the pull requests and contributions.
v3.0.6: assemble CLI, Matcher alignments, training from streamed corpora and many bug fixes
✨ New features and improvements
- New
assemble
CLI command for assembling a pipeline from a config without training. - Add support for match alignments in the
Matcher
to align matched tokens with matcher patterns. - Add support for training from streamed corpora.
- Add support for W&B data and model checkpoint logging and versioning in
spacy.WandbLogger.v2
. - Extend
Scorer.score_spans
to support overlapping and unlabeled spans. - Update
debug data
for new v3 components. - Improve language data for Italian.
- Various improvements to error handling and UX.
🔴 Bug fixes
- Fix issue #7408: Add
vocab
kwarg tospacy.load
. - Fix issue #7419: Exclude user hooks in displacy conversion.
- Fix issue #7421: Update
--code
usage in CLI commands. - Fix issue #7424: Preserve sent starts on retokenization without parse.
- Fix issue #7440: Fix pymorphy2 lookup lemmatizer.
- Fix issue #7471: Improve warnings related to listening components.
- Fix issue #7488: Fix
upstream
check in pretraining. - Fix issue #7489: Support
callbacks
entry points. - Fix issue #7497: Merge
doc.spans
inDoc.from_docs()
. - Fix issue #7528: Preserve user data for
DependencyMatcher
on spans. - Fix issue #7557: Fix
__add__
method forPRFScore
. - Fix issue #7574: Fix conversion of custom extension data in
Span.as_doc
andDoc.from_docs
. - Fix issue #7620: Fix
replace_listeners
in configs. - Fix issue #7626: Fix vectors data on GPU.
- Fix issue #7630: Update NEL for entities crossing sentence boundaries.
- Fix issue #7631: Fix parser sourcing in NER converter.
- Fix issue #7642: Fix handling of hyphen string value in config files.
- Fix issue #7655: Fix sent starts when converting from v2 JSON training format.
- Fix issue #7674: Fix handling of unknown tokens in
StaticVectors
. - Fix issue #7690: Fix pickling of
Lemmatizer
. - Fix issue #7749: Update
Tokenizer.explain
for special cases in v3. - Fix issue #7755: Fix config parsing of ints/strings.
- Fix issue #7836: Fix tokenizer cache flushing.
- Fix issue #7847: Fix handling of boolean values in
Example.from_dict
for sent starts.
📖 Documentation and examples
- Add documentation for legacy functions and architectures.
- Add documentation for pretrained pipeline design.
- Add more details about
pipe
and multiprocessing. - Fix various typos and inconsistencies.
👥 Contributors
Thanks to @alvaroabascar, @armsp, @AyushExel, @BramVanroy, @broaddeep, @bryant1410, @bsweileh, @dpalmasan, @Findus23, @graue70, @jaidevd, @koaning, @langdonholmes, @m0canu1, @meghanabhange, @paoloq, @plison, @richardpaulhudson, @SamEdwardes, @Stannislav for the pull requests and contributions!
v3.0.5: Bug fix for thinc requirement
🔴 Bug fixes
- Fix related to issue #7075: Update
thinc
requirement for Jupyter notebook GPU warning
v3.0.4: Fix tok2vec pretraining, source disabled components, better UX and bug fixes
✨ New features and improvements
- Allow sourcing disabled components in config.
- Support
Doc.spans
inExample.from_dict
. - Improve transformer recommendations in quickstart widget and
init config
. - Improve language data for Bulgarian.
- Various improvements to error handling and UX.
🔴 Bug fixes
- Fix issue #6952, #7285, #7289: Make
tok2vec
pretraining andpretrain
command work as expected again. - Fix issue #7062: Only evaluate named entities for NEL if there is a corresponding gold span.
- Fix issue #7065: Correctly handle sentence boundaries in
Span.sent
. - Fix issue #7071: Fix
conll
converter option. - Fix issue #7100: Re-add
n_sents
to entity linker and fix config handling and I/O. - Fix issue #7122: Fix displaCy output in
evaluate
CLI. - Fix issue #7127: Fix initialization of
UkrainianLemmatizer
. - Fix issue #7176: Re-refactor
Sentencizer
to usePipe
API. - Fix issue #7182: Allow
SpanGroup
import fromspacy.tokens
. - Fix issue #7204: Adjust Cython compilation for setups with custom include paths.
- Fix issue #7222: Correct YAML formatting in quickstart recommendations for
bg
andbn
. - Fix issue #7225: Fix
spans
weakref inDoc.copy
. - Fix issue #7237: Fix
is_cython_func
for additional imported code. - Fix issue #7250: Fix patience for identical scores.
- Fix issue #7329: Make
spacy.orth_variants.v1
andspacy.lower_case.v1
augmenters work as expected. - Fix issue #7352: Sort
EntityRuler.labels
alphabetically.
📖 Documentation and examples
- Add documentation for
textcat_multilabel
component. - Extend documentation for
Vocab.get_noun_chunks
. - Fix various typos and inconsistencies.
👥 Contributors
Thanks to @MartinoMensio, @SergeyShk, @R1j1t, @palandlom, @dardoria, @tocic, @clippered, @graue70, @koaning and @jankrepl for the pull requests and contributions!