forked from JohnSnowLabs/spark-nlp
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCHANGELOG
838 lines (740 loc) · 42.6 KB
/
CHANGELOG
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
========
1.7.3
========
---------------
Overview
---------------
This hotfix release focuses on fixing word-embeddings cluster problems on some frameworks such as Databricsk, while keeping 1.7.x performance benefits. Various YARN based clusters have been tested, databricks cloud among them to test this hotfix.
Aside of that, multiple improvements have been commited towards a better support of PySpark-NLP, fixing diverse technical issues in the API that help consistency in Annotator's super classes.
Finally, PIP installation has been made easier with a SparkNLP class that creates SparkSession automatically, for those who are learning Python Spark on their local computers.
Thanks to all the community for reporting issues.
---------------
Bugfixes
---------------
* Fixed 'RocksDB not serializable' when running LightPipeline scenarios or using _.functions implicits
* Fixed dependency with apache.commons.codec causing Apache Zeppelin 0.8.0 not to work in %pyspark
* Fixed Python pretrained() downloader not correctly setting Params and incorrectly creating new Model UIDs
* Fixed error 'JavaPackage not callable' when using AnnotatorModel.load() API without instantiating the class first
* Fixed Spark addFiles missing local file causing Word Embeddings not properly work in some Cluster-based frameworks
* Fixed broadcast NoSuchElementException `Failed to get broadcast_6_piece0 of broadcast_6` causing pretrained models not work in cluster frameworks (thanks @EnricoMi)
---------------
Developer API
---------------
* EmbeddingsHelper.setRef() has been removed. Reference is now set implicitly through EmbeddingsHelper.load(). Does not need to be loaded before deserializing models.
* Fixed and properly renamed chunk2doc and dock2chunk transformers, should now be working as expected
* Renamed setCompositeTokens to setCompositeTokensPatterns to help user remind that regex are being used in such Param
* Fixed PySpark automatic getter and setter Param generation when using pretrained() or load() models
* Simplified cluster path resolution for word embeddings
---------------
Other
---------------
* sparknlp.base now contains SparkNLP() classs which automatically cretes SparkSession using appropriate jar settings. Helps newcomers get started in PySpark NLP.
========
1.7.2
========
---------------
Overview
---------------
Quick release with another hotfix, due to a new found bug when deserializing word embeddings in a distributed fs. Also introduces changes in application.conf reader in order
to allow run-time changes. Also introduces renaming from EmbeddingsHelper API.
---------------
Bugfixes
---------------
* Fixed embeddings deserialization from distributed filesystem (caused due to windows pathfix)
* Fixed application.conf not reading changes in runtime
* Added missing remote_locs argument in python pretrained() functions
* Fixed wrong build version introduced in 1.7.1 to detect proper pretrained models version
---------------
Developer API
---------------
* Renamed EmbeddingsHelper functions for more convenience
========
1.7.1
========
---------------
Overview
---------------
Thanks to our slack community (Bryan Wilkinson, @maziyarpanahi, @apiltamang), a few bugs been pointed out very quickly from 1.7.0 release. This hotfix fixes an embeddings deserialization issue when cache_pretrained is located on a distributed filesystem.
Also, fixes some path resolution in Windows OS. Thanks to Maziyar, .gitattributes been added in order to identify proper languages in GitHub.
Finally, 1.7.1 adds a missing annotator from 1.7.0 Chunk2Doc, which converts CHUNK types into DOCUMENT types, for further retokenization or other annotations.
---------------
Enhancements
---------------
* Chunk2Doc annotator converts annotatorType from CHUNK to DOCUMENT
---------------
Bugfixes
---------------
* Fixed embedding-based annotators deserialization error when cache_pretrained is on distributed fs (Thanks Bryan Wilkinson for pointing out issue and testing fix)
* Fixed windows path reading when deserializing embeddings (Thanks @apiltamang)
---------------
Other
---------------
* .gitattributes added in order to properly discard jupyter as main language for GitHub repo (thanks @maziyarpanahi)
========
1.7.0
========
---------------
Overview
---------------
Having multiple annotators that use the same word embeddings set, may result in huge pipelines, driver memory and storage consumption.
Since now on, embeddings may be shared and reutilized across annotators making the process much more efficient.
Also, thanks to @apiltamang, we now better support path resolution for Windows implementations.
---------------
Enhancements
---------------
Memory and storage saving by allowing annotators with embeddings through params 'includeEmbeddings' and 'embeddingsRef' to allow them to set whether they should be included when saved, or referenced by id from other annotators
EmbeddingsHelper class allows embeddings management
---------------
Bug fixes
---------------
Thanks to @apiltamang for improving URI path support for Windows Servers
---------------
Developer API
---------------
Embeddings interfaces and method names completely refactored, hopefully simplified and easier to understand
========
1.6.3
========
---------------
Overview
---------------
This release includes a new annotator for de-identification of sensitive information. It uses CHUNK annotations, meaning its accuracy will depend on previous annotators on the pipeline.
Also, OCR capabilities have been improved in the OCR module.
In terms of broken stuff, we've fixed a few annoying bugs on SymmetricDelete and SentenceDetector explode feature.
Finally, pip is now part of the official repositories, meaning you can install it just as any other module. It also includes jars and we've added a SparkNLP class which creates SparkSession easily for you.
Thanks again for all community contribution in issues, feedback and comments in GitHub and in Slack.
---------------
New features
---------------
* DeIdentification annotator, takes DOCUMENT and TOKEN from the original sentence, plus a CHUNK annotation to anonymize target chunk in sentence. CHUNK annotation might come from NerConverter, TextMatcher or other chunk annotators.
---------------
Enhancements
---------------
* Kernel zoom and region erosion improve overall detection quality. Fixed some stability bugs. Improved parallelism
---------------
Bug fixes
---------------
* Sentence Detector explode sentences into rows now works properly
* Fixed Dictionary-based sentiment detector not working on pyspark
* Added missing NerConverter to annotator._ imports
* Fixed SymmetricDelete spell checker deleting tokens in some scenarios
* Fixed SymmetricDelete spell checker unwilling lower-casing
---------------
Other
---------------
* PySpark pip now part from official pip repos
* Pip installation now includes corresponding spark-nlp jar. base module includes SparkNLP SparkSession creator
========
1.6.2
========
---------------
Overview
---------------
In this release, we focused on reviewing out streaming performance, buy measuring our amount of sentences processed by second, through a LightPipeline.
We increased Norvig Spell Checker by more than 300% by disabling DoubleVariants and improving algorithm orders. It is now reported capable of 42K sentences per second.
Symmetric Delete Spell checker is more performance, although it has been reported to process 2K sentences per second.
NerCRF has been reported to process 300 hundred sentences per second, while NerDL can do twice fast (about 700 sentences per second).
Vivekn Sentiment Analysis was improved and is now capable to processing 100K sentences per sentence (before it was below 500).
Finally, SentenceDetector performance was improved by a 40% from ~30K rows processed per second to ~40K. But, we have now enabled Abbreviation processing by default which reduces final speed to 22K rows per second with a negative net but better accuracy.
Again, thanks for the community for helping with feedback. We welcome everyone asking questions or giving feedback in our Slack channel or reporting issues on Github.
---------------
Enhancements
---------------
* OCR now features kernel segmentation. Significantly improves image based PDF processing
* Vivekn Sentiment Analysis prediction performance improved by better data structures
* Both Norvig and Symmetric Delete spell checkers now have improved performance
* SentenceDetector improved accuracy by better handling abbreviations. UseAbbreviations now also by default turned ON
* SentenceDetector improved performance significantly by improved preloading of rules
---------------
Bug fixes
---------------
* Fixed NerDL not training correctly (broken since 1.6.0). Pretrained models not affected
* Fixed NerConverter not properly considering multiple sentences per row (after using SentenceDetector), causing an unhandled exception to occur in some scenarios.
* Tensorflow sessions now all support allow_soft_placement, supporting GPU based graphs to work with and without GPU
* Norvig Spell Checker fixed a missing step from the algorithm to check for additional variants. May improve accuracy
* Norvig Spell Checker disabled DoubleVariants by default. Was not improving accuracy significantly and was hitting performance very hard
---------------
Developer API
---------------
* New FeatureSet allows HashSet params
---------------
Models
---------------
* Vivekn Sentiment Pipeline doesn't have Spell Checker anymore
* Fixed Vivekn Sentiment pretrained improved accuracy
========
1.6.1
========
---------------
Overview
---------------
Hi! We're glad to announce new hotfix 1.6.1. Although changes seem modest or very specific, there is a lot going underground. First of all, we've worked hard with the community to understand S3-based clusters,
which don't have a common fs.defaultFS configuration, which is the one we use to tell where is the cluster temp folder located in order to distribute word embeddings. We fixed two things here,
on one side we fixed a bug pointing to the wrong filesystem. Second, we added a custom override setting in application.conf that allows manually setting where to put temp folders in cluster. This should help S3 users.
Please share your feedback on this regard.
On the other hand, we created a new annotator type internally. The CHUNK type allows better modulary in the communication between different annotators. Impact will be noticed implicitly and over time.
---------------
New features
---------------
* new Scala-only functions that make it easier to work with Annotations in Dataframes. May be imported through com.johnsnowlabs.nlp.functions._ and allow mapping and filtering within and outside Annotations.
filterByAnnotations, mapAnnotations and explodeAnnotations work by providing a column and a function. Check out documentation. Possibly later coming to Python.
---------------
Bug fixes
---------------
* Fixed incorrect filesystem readings in some S3 environments for word embeddings
* Fixed NerCRF not correctly training from CONLL, labeling everything as -O- (Thanks @arnound from Slack Channel)
---------------
Enhancements
---------------
* Added overrideable config sparknlp.settings.cluster_tmp_dir allows setting cluster location for temporary embeddings file. May help S3 based clusters with no fs.defaultFS set to a proper distributed storage.
* New annotator type: CHUNK. Representes a SUBSTRING of DOCUMENT and it is used as output from NerConverter, TextMatcher, RegexMatcher and other annotators that retrieve a substring from the original document.
This will make for better modularity and integration within various annotators, such as between NER and AssertionStatus.
* New annotation transformer: ChunkAssembler. Takes a string or array(string) column from a dataset and creates a CHUNK type annotator. The content must also belong to the current DOCUMENT annotation's content.
* SentenceDetector new param explodeSentences allow to explode sentences within a single row into different rows to increase parallelism and performance in some scenarios. Particularly OCR based.
* AssertionDLApproach now may be used within LightPipelines
* AssertionDLApproach and AssertionLogRegApproach now work from CHUNK type instead of start/end bounds. May still be trained with Start/end though. This means target for assertion may be any CHUNK output annotator now (e.g. RegexMatcher)
---------------
Other
---------------
* PerceptronApproachLegacy moved back to default PerceptronApproach. Distributed PerceptronApproach moved to PerceptronApproachDistributed due to not meeting accuracy expectations yet.
* Some configuration parameters in application.conf have been appropriately moved to proper annotator Params (NorvigSweeting Spell Checker, Vivekn Approach and Sentiment Detector affected)
* application.conf renamed configuration values for better consistency
---------------
Developer API
---------------
* Added beforeAnnotate() and afterAnnotate() to manipulate dataframes after or before calling annotate() UDF
* Added extraValidate() and extraValidateMsg() in all annotators to provide developer to add additional SCHEMA checks in transformSchema() stage
* Removed validation() stage in fit() stage. Allows for more flexible training when some of the columns are not really required yet.
* WrapColumnMetadata() will wrap an Annotation column with its appropriate Metadata. Makes it easier not to forget about Metadata in Schema.
* RawAnnotator trait has now all the basics needed to start a new Annotator without annotate() function. It is a complete previous stage before AnnotatorModel, which inherits from RawAnnotator.
========
1.6.0
========
---------------
Overview
---------------
We're late! But it was worth it. We're glad to release 1.6.0 which brings new features, lots of enhancements and many bugfixes. First of all, we are thankful for community participating in Slack and in GitHub by reporting feedback and issues.
In this one, we have a new annotator, the Chunker, which allows to grab pieces of text following a particular Part-of-Speech pattern.
On the other hand, we have a brand new OCR to Spark Dataframe utility, which bundles as an optional component to Spark-NLP. This one requires tesseract 4.x+ to be installed on your system, and may be downloaded from our website or readme pages.
Aside from that, we improved in many areas, from the DocumentAssembler to work better with OCR output, down to our Deep Learning models with better consistency and accuracy. Word Embedding based annotators also receive improvements when working in Cluster environments.
Finally, we are glad a user contributed a fix to the AWS dependency issue, particularly happening in Cloudera environments. We're still waiting for feedback, and gladly accept it.
We'll be working on the documentation as this release follows. Thank you.
---------------
New Features
---------------
* New annotator: Chunker. This annotator takes regex for Part-of-Speech tags and returns appropriate chunks of text following such patterns
* OCR to Spark-NLP: As an optional jar module, users may use OcrHelper class in order to convert PDF files into Spark Dataset, ready to be utilized by Spark-NLP's document assembler. May be used without Spark-NLP. Requires Tesseract 4.x on your system.
---------------
Enhancements
---------------
* TextMatcher now has caseSensitive (setCaseSensitive) Param which allows to setup for matching with case sensitivity or not (Ignores if Normalizer did it). Returned word is still the original.
* LightPipelines in Python should now be faster thanks to an optimization of prefetching results into Python memory instead of py4j bridge
* LightPipelines can now handle embedded Pipelines
* PerceptronApproach now trains utilizing full Spark distributed algoritm. Still experimental. PerceptronApproachLegacy may still be used, which might be better for local non cluster setups.
* Tokenizer now has a param 'includeDefaults' which may be set to False to disable all preset-rules.
* WordEmbedding based annotators may now decide to normalize tokens before matching embeddings vectors through 'useNormalizedTokensForEmbeddings' Param. Generally improves consistency, lesser overfitting.
* DocumentAssembler may now better deal with large amounts of texts by using 'trimAndClearNewLines' to better work with OCR Outputs and be better ready for further Sentence Detection
* Improved SentenceDetector handling of enumerations and lists
* Slightly improved SentenceDetector performance through non-tail-recursive optimizations
* Finisher does no longer have default delimiters when output into String (not Array) (thanks @S_L)
---------------
Bug fixes
---------------
* AWS library dependecy conflict now resolved (Thanks to @apiltamang for proposing solution. thanks to the community for follow-up). Solution is experimental, waiting for feedback.
* Fixed wrong order of further added Tokenizer's infixPatterns in Python (Thanks @sethah)
* Training annotators that use Word Embeddings in a distributed cluster does no longer throw file not found exceptions sporadically
* Fixed NerDLModel returning non-deterministic results during prediction
* Deep-Learning based models and graphs now allow running them on CPU if trained on GPU and GPU is not available on client
* WordEmbeddings temporary location no longer in HOME dir, moved to tmp.dir
* Fixed SentenceDetector incorrectly bounding sentences with non-English characters (Thanks @lorenz-nlp)
* Python Spark-NLP annotator models should now have all appropriate setter and getter functions for Params
* Fixed wrong-format of column when showing Metadata through Finisher's output as Array
* Added missing python Finisher's include metadata function (thanks @PinusSilvestris for reporting the bug)
* Fixed Symmetric Delete Spell Checker throwing wrong error when training with an empty dataset (Thanks @ankush)
---------------
Developer API
---------------
* Deep Learning models may now be read through SavedModelBundle API into Tensorflow for Java in TensorflowWrapper
* WordEmbeddings now allow checking if word exists with contains()
* Included tool that converts text into CoNLL format for further labeling for training NER models (
========
1.5.4
========
---------------
Overview
---------------
This release improves various annotators: the Normalizer, SymmetricDelete, TextMatcher, DocumentAssembler and Finisher
allowing them to cover more use-cases that were mentioned in our Slack channel. We also fixed two important bugs.
Finally, this will be our first release with PIP support for python sparknlp, for those entirely python based.
---------------
Enhancements
---------------
* Normalizer now allows multiple to-delete regex patterns.
* Normalizer slangDictionary param allows converting tokens into something else (e.g. 'lol' into 'laughing out loud') from a dictionary file
* SymmetricDelete spell checker may now be trained from the dataset passed to fit if external corpus not provided
* SymmetricDelete spell checker improved training and prediction performance
* Finisher param includeMetadata now outputs annotation metadata content both in Array format or String format
* DocumentAssembler may now read from Array[String] column if provided. This improves compatibility for some SparkML transformers
* TextMatcher now includes identifier name in metadata
---------------
Bug fixes
---------------
* Fixed a bug introduced in 1.5.3 that made spark-nlp not to work in Python2 (thanks @surendralalwani)
* Fixed SymmetricDeleteApproach wrong annotator type
---------------
Other
---------------
* setup.py for PIP support (instructions will be added to readme and website). Still needs spark-nlp jar in SparkSession classpath.
========
1.5.3
========
---------------
Overview
---------------
This quick release is a hotfix for issues found on 1.5.2 after it's release. Thanks to the users who quickly tested this out.
It fixes Symmetric spell checker not being capable of reading the pretrained model, a SentenceDetector missing default value and retroactive version matching to the downloader
---------------
Bug fixes
---------------
* Fixed a bug causing the library to fail when trying to save or read an annotator with an unset Feature without default
* Added missing default Param value to SentenceDetector. Thanks @superman24-7
* Symmetric spell checker now utilizes List instead of ListBuffer on its prediction layer
* Fixed Vivekn Sentiment Analysis failing when training with a sentiment column
---------------
Models
---------------
* Symmetric Spell Checker pretrained model now works well and may be downloaded
* Vivekn Sentiment pretrained model now defaults to "token" input column instead of "spell"
---------------
Other
---------------
* Downloader now works retroactively when a newer version finds a model of a previous release
* Renamed folder argument to remote_loc for downloader remote location, which caused confusion. Thanks @AtulSehgal
* Added new Scala example in example folder, also available on website
========
1.5.2
========
---------------
Overview
---------------
This release focuses on improving model downloader stability, fixing word embedding reading issues and joining
spark ecosystem filesystem configuration appropriately, utilizing spark's defined default filesystem, in order to work
properly with clusters and multi node environments. This includes Databricks cloud clusters or Amazon EMR yarn HDFS nodes.
Aside of that we come up with exciting new features, a brand new Spell Checker with higher accuracy inspired on the
Symmetric delete algorithm.
Finally Assertion Status can be trained and predicted on top of NER output, since before
this only worked by providing assertion status Start and End boundaries for the target to assert.
---------------
New Features
---------------
* Assertion status annotators can now be trained and predict against NER output instead of start and end boundaries. Entities can now be directly asserted
* Brand new Symmetric Delete annotator (SymmetricDeleteApproach) with closer to start of the art optimal accuracy 80%
---------------
Enhancements
---------------
* Model downloader now uses proper spark filesystem. Works properly with distributed storage, databricks cloud clusters or amazon EMR seamlessly
* Fixed several race condition while loading word embeddings from disk or download resources, library is more stable
* Improved several assertion status validations and error messages
---------------
Bug fixes
---------------
* Stand alone Annotator models are now properly read from disk in python
---------------
Models
---------------
* New Symmetric Delete Spell checker pretrained model
* Vivekn Sentiment annotator may now be downloaded standalone with pretrained()
========
1.5.1
========
---------------
Overview
---------------
This release is an enhancement release to 1.5.0 which includes improved downloader properties and better annotator defaults.
Also, assertion status models have been included as pretrained, which are models trained on top of Glove Stanford word embeddings
---------------
Enhancements
---------------
* SentenceDetector has now a useCustomOnly param which enforces into using only the custom bounds provided (thanks @atomobianco)
* Normalizer defaults to not lowerCase words leads to better implicit accuracy in pipelines (thanks @marek.modry)
* SpellChecker defaults to be case sensitive leads to better accuracy
* DateMatcher improved speed performance
* com.johnsnowlabs.annotator._ in Scala now also includes RecursivePipelines and LightPipelines for easier imports
* ModelDownloader has been improved with better directory management
---------------
Models
---------------
* New Assertion Status (LogisticRegression and DeepLearning) pretrained models now available
* Vivekn, Basic and Advanced pretrained Pipelines improved accuracy (thanks @marek.modry)
---------------
Other
---------------
* S3 library dependencies updated
========
1.5.0
========
---------------
Overview
---------------
We are proud to announce if not the biggest release in terms of content in Spark-NLP!
This release makes the library miles easier to use for new comers, allowing easier to import
annotators and the extended use of model downloader throughout pretrained models and pipelines.
This also includes two new annotators that use deep learning algorithms with graphs from TensorFlow, which
is the first time we do so.
Apart from this, we include new Light Pipelines that are 10x times faster when working with data smaller than about
50,000 rows length.
Finally, we included several bugfixes across the library, from algorithm wise to developer API.
We'll gladly welcome any feedback! The website has been extensively updated.
---------------
New features
---------------
* Light Pipelines are Annotator Pipelines created from SparkML pipelines that run more than 10x faster in small datasets
* Deep Learning NER based on Bi-LSTM and Convolutional Neural Networks from word embeddings datasets
* Deep Learning Assertion Status model based on LSTM to compute status identification from word embeddings
* Easier to use Spark-NLP:
1. Imports have been made easy in scala API (com.johnsnowlabs.annotator._) to bring all annotators
2. BasicPipeline and AdvancedPipeline downloadable pipelines created for quick annotation of text
3. Light Pipelines are easy to use and accept simple strings to annotate a Spark ML Pipeline without spark datasets
* New Downloadable models: CRF NER, Lemmatizer, POS and Spell checker
* New Downloadable pipelines: Vivekn Sentiment analysis, BasicPipeline and AdvancedPipeline
---------------
Enhancements
---------------
* Model downloader significantly improved in terms of usability
---------------
Documentation
---------------
* Website widely improved
* Added invite to our first slack chat channel
---------------
Bugfixes
---------------
* Fixed positional index wrong value when creating Annotations from constructor
* Fixed hamming distance calculation in spell checker
* Fixed Downloadable NER model failing sporadically due to missing temporary files
* Fixed SearchTrie algorithm used in TextMatcher (fmy. EntiyExtractor) thanks @avenka11 for reporting and proposing solution
* Fixed some model deserialization issues happening on Windows
---------------
Other
---------------
* Thanks to @showy we have TravisCI automatic integration testing
* Finisher now outputs to array by default
* Training example resources removed in advantage of using the model downloader more
========
1.4.2
========
---------------
Bugfixes
---------------
* Filesystem protocols now properly read across the library, fixed use case for S3:// protocol (thanks @avenka11)
* Library now works properly in Windows environments
* PySpark annotator param getters now work properly when retrieving default values
* Fixed stemmer serialization due to misspelled param name
* Fixed Tokenizer infixPattern param name to infixPatterns, leading to broken pyspark serialization of such param
* Added missing addInfixPattern() function to PySpark, to allow adding patterns to current value
* Model Downloader clearCache now properly removes both .zip files and extracted content
* Model Downloader is now capable of reading all types of models properly
* Added missing clearCache function into PySpark
---------------
Developer API
---------------
* Function names in model downloader code has been refactored consistenyl
---------------
Other
---------------
* RocksDB rolled back to previous version to support Windows
* NerCRF unittest modified to reduce time to test
* Removed training scripts from repository
* Updated build spark and scala version
========
1.4.1
========
---------------
New features
---------------
* Model and Pipeline Downloader
We are glad to announce our first experimental model downloader, working both in Python and Scala.
This allows to download pre-trained models from our public storage. This does not include any pre-trained models yet
but just the logic to be able to do it.
---------------
Enhancements
---------------
* Improved ExternalResource API (introduced in 1.4.0) to make it easier to provide external corpus and resource information
on annotators such as readAs (which allows setting how would you like SparkNLP to read your source), delimiters and parse settings among
other options that might be passed to Spark Reader directly. Annotators using external sources now all share this functionality.
WordEmbeddings are not yet supported on this format.
* All python annotators now properly have getter functions to retrieve param values
--------------
Bugfixes
--------------
* Fixed some annotators in python not de-serializable on their own outside a Pipeline
* Fixed CRF NER not working when not using word embeddings (thanks @crisliu for reporting)
* Fixed Tokenizer not properly recognizing some stop words (thanks @easimadi)
* Fixed Tokenizer not properly recognizing composite tokens when changing target pattern param (thanks @easimadi)
* ReadAs parameter now properly read from string in all ExternalResource setters
---------------
Developer API
---------------
* PySpark API further improvements within AnnotatorApproach, AnnotatorModel and now private internal _AnnotatorModel for fit() result representation
* Automated getter have been written in order not to have to write getter functions in all annotators manually
-----------
Other
-----------
* RocksDB dependency rolled back to 5.2.1 for better universal compatibility particularly to support databricks platform
---------------
Documentation
---------------
* Updated website components page to match 1.4.x
* Replaced notebooks site to a placeholder linking to current python notebooks for lower maintenance
========
1.4.0
========
---------------
New features
---------------
* All annotator external sources have been unified through an ExternalResource component.
This is used to represents external data information deals with content in HDFS or local just as spark deals with data.
It also improves performance globally and allows customization
into how these sources are read (e.g. as RDD or Line by Line sequences)
* NorvigSweeting SpellChecker, ViveknSentiment and POS Perceptron can now train from the dataset passed to fit().
For Spell Checker, this will be applied if the user did not supply a corpus, forcing fit() to learn from words in the data column.
For ViveknSentiment and POS Perceptron, this strategy will be applied if sentimentCol and posCol params have been set respectively.
---------------
Enhancements
---------------
* ResourceHelper now has an improved SourceStream class which allows for more consistent HDFS/Filesystem reading by using
more of the Hadoop APIs.
* application.conf is now a global setting and can be overridden in run-time through ConfigLoader.setConfigPath(). It may also be accessed from PySpark
* PySpark API improved by creating AnnotatorApproach and AnnotatorModel classes
* EntityMatcher now uses recursive Pipelines
* Part-of-Speech tagging performance has been improved throughout the prediction algorithm
* EntityMatcher may now use RecursivePipeline in order to tokenize external data with the same pipeline provided by the user
---------------
Developer API
---------------
*PySpark API has been severly improved to make it easier to extend JVM classes
*PySpark API improved for extending annotator approaches and models appropriately
----------------
Bugfixes
----------------
* Reverted a bug introduced causing NER not to read datasets properly from HDFS
* Fixed EntityMatcher wrongly normalizing external content (thanks @sofianeh)
----------------
Documentation
----------------
* Fixed EntityMatcher documentation obsolete params (Thanks @sofianeh)
* Fixed NER CRF documentation in website
========
1.3.0
========
IMPORTANT: Pipelines from 1.2.6 or older cannot be loaded from 1.3.0
---------------
New features
---------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/94
Tokenizer annotator has been revamped. It now follows standard NLP Rules, matching above 90% of StanfordNLP Tokens
This annotator has now more complex rules allowing setting custom composite words as exceptions (e.g. to not break New York)
and custom Prefix, Infix, Suffix and Breaking rules. It uses regular expression groups in order to match various tokens per target word
Defaults have been updated to also be language agnostic and support foreign characters from Unicode charset
* https://github.com/JohnSnowLabs/spark-nlp/pull/93
Assertion Status. This annotator identifies negated sequences within target scope. Assertion status is a machine learning
annotator and works throughout a set of Word Embeddings which a set of them is provided as a part of our Python notebook examples.
* https://github.com/JohnSnowLabs/spark-nlp/pull/90
Recursive Pipelines. We have created our own Pipeline class which will take more advantages from Spark-NLP annotators.
Although this Pipeline is completely optional and works well with default Apache Spark estimators and transforms, it allows
training our annotators more efficiently by allowing annotator approaches access the previous state of the Pipeline,
allowing them to use it to tokenize or transform their own external content. It is recommended to use such Pipelines.
----------------
Enhancements
----------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/83
Part of Speech training has been improved in both performance and quality, and now better makes use of the input corpus provided.
New params have been extended in order to have more control of its training, through corpusFormat and corpusLimit, allowing
whether to read training data as Dataset or raw text files, and the number of limit files if a folder is provided
* https://github.com/JohnSnowLabs/spark-nlp/pull/84
Thanks to @lambdaofgod to allow Normalizer to optionally lower case tokens
* Thanks to Lorenz Bernauer, Normalizer default pattern now becomes language agnostic by not breaking unicode characters such as Spanish or German letters
* Features now have appropriate default values which are lazy by nature and executed only once upon request. This improves by side effect to the Lemmatizer performance.
* RuleFactory (A regex rule factory) performance has been improved due to set to use a Factory pattern and not re-check it's strategy on every transformation in run-time.
This might have positive side effects in SentenceDetector, DateMatcher and RegexMatcher which extensively use this class.
----------------
Class Renames
----------------
RegexTokenizer -> Tokenizer (it is not just regex anymore)
SentenceDetectorModel -> SentenceDetector (it is not a model, it is a rule-based algorithm)
SentimentDetectorModel -> SentimentDetector (it is not a model, it is a rule-based algorithm)
----------------
User Utilities
----------------
* ResourceHelper has a function createDatasetFromText which allows the user to more
easily read one or multiple text files from path into a dataset with various options,
including filename by row or by file aggregation. This class should be more widely
used since it helps dealing with local files parsing. It shall be better documented.
* com.johnsnowlabs.util now contains a Benchmark class which allows measuring the time of
any function easily, by using it as Benchmark.time("Description of measured") {someFunction()}
----------------
Developer API
----------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/89/files
Word embedding traits have been generalized. Now any annotator who might want to use them can easily access their properties
* Recursive pipelines now allow injecting PipelineModel object into train() stage. It is an optional parameter. If the user
utilizes RecursivePipeline, the annotator might use this pipeline for transforming secondary data inputs.
* Annotator abstract class has been divided into a previous RawAnnotator class which contains all annotator properties
and validations, but does not make use of the annotate() function. This allows annotators that need to work directly with
the transform() call, but also participate between other annotators in the pipeline
----------------
Bugfixes
----------------
* Fixed a bug in annotators with word embeddings not correctly serializing into disk
* Fixed a bug creating temporary folders in home folder
* Fixed a broken geospatial pattern in sentence detection
========
1.2.6
========
---------------
Enhancements
---------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/82
Vivekn Sentiment Analysis improved memory consumption and training performance
Parameter pruneCorpus is an adjustable value now, defaults to 1. Higher values lead to better performance
but are meant on larger corpora. tokenPattern params are meant to allow different tokenization regex
within the corpora provided on Vivekn and Norvig models.
* https://github.com/JohnSnowLabs/spark-nlp/pull/81
Serialization improvements. New default format (parquet lasted little) is RDD objects. Proved to be lighter on
heap memory. Also added lazier default values for Feature containers. New application.conf performance tunning
settings allow to customize whether we want to Feature broadcast or not, and use parquet or objects in serialization.
========
1.2.5
========
IMPORTANT: Pipelines from 1.2.4 or older cannot be loaded from 1.2.5
---------------
New features
---------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/70
Word embeddings parameter for CRF NER annotator
* https://github.com/JohnSnowLabs/spark-nlp/pull/78
Annotator features replace params and are now serialized using KRYO and partitioned files, increases performance and smaller
memory consumption in Driver for saving and loading pipelines with large corpora. Such features are now also broadcasted
for better performance in distributed environments. This enhancement is a breaking change, does not allow to load older pipelines
----------------
Bug fixes
----------------
* https://github.com/JohnSnowLabs/spark-nlp/commit/cb9aa4366f3e2c9863482df39e07b7bacff13049
Stemmer was not capable of being deserialized (Implements DefaultParamsReadable)
* https://github.com/JohnSnowLabs/spark-nlp/pull/75
Sentence Boundary detector was not properly setting bounds
----------------
Documentation (thanks @maziyarpanahi)
----------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/79
Typo in code
* https://github.com/JohnSnowLabs/spark-nlp/pull/74
Bad description
========
1.2.4
========
---------------
New features
---------------
* https://github.com/JohnSnowLabs/spark-nlp/commit/c17ddac7a5a9e775cddc18d672e80e60f0040e38
ResourceHelper now allows input files to be read in the shape of Spark Dataset, implicitly enabling HDFS paths, allowing larger annotator input files. Needs to set 'TXTDS' as input format Param to let annotators read this way. Allowed in: Lemmatizer, EntityExtractor, RegexMatcher, Sentiment Analysis models, Spell Checker and Dependency Parser.
---------------
Enhancements and progress
---------------
* https://github.com/JohnSnowLabs/spark-nlp/commit/4920e5ce394b25937969cc4cab1d81172be722a3
CRF NER Benchmarking progress
* https://github.com/JohnSnowLabs/spark-nlp/pull/64
EntityExtractor refactored. This annotator uses an input file containing a list of entities to look for inside target text. This annotator has been refactored to be of better use and specifically faster, by using a Trie search algorithm. Proper examples included in python notebooks.
---------------
Bug fixes
---------------
* Issue https://github.com/JohnSnowLabs/spark-nlp/issues/41 <> https://github.com/JohnSnowLabs/spark-nlp/commit/d3b9086e834233f3281621d7c82e32195479fc82
Fixed default resources not being loaded properly when using the library through --spark-packages. Improved input reading from resources and folder resources, and falling back to disk, with better error handling.
* https://github.com/JohnSnowLabs/spark-nlp/commit/08405858c6186e6c3e8b668233e30df12fa50374
Corrected param names in DocumentAssembler
* Issue https://github.com/JohnSnowLabs/spark-nlp/issues/58 <> https://github.com/JohnSnowLabs/spark-nlp/commit/5a533952cdacf67970c5a8042340c8a4c9416b13
Deleted a left-over deprecated function which was misleading.
* https://github.com/JohnSnowLabs/spark-nlp/commit/c02591bd683db3f615150d7b1d121ffe5d9e4535
Added a filtering to ensure no empty sentences arrive to unnormalized Vivekn Sentiment Analysis
---------------
Documentation and examples
---------------
* https://github.com/JohnSnowLabs/spark-nlp/commit/b81e95ce37ed3c4bd7b05e9f9c7b63b31d57e660
Added additional resources into FAQ page.
* https://github.com/JohnSnowLabs/spark-nlp/commit/0c3f43c0d3e210f3940f7266fe84426900a6294e
Added Spark Summit example notebook with full Pipeline use case
* Issue https://github.com/JohnSnowLabs/spark-nlp/issues/53 <> https://github.com/JohnSnowLabs/spark-nlp/commit/20efe4a3a5ffbceedac7bf775466b7a8cde5044f
Fixed scala python documentation mistakes
* https://github.com/JohnSnowLabs/spark-nlp/commit/782eb8dce171b69a615887b3defaf8b729b735f2
Typos fix
---------------
Other
---------------
* https://github.com/JohnSnowLabs/spark-nlp/commit/91d8acb1f0f4840dad86db3319d0b062bd63b8c6
Removed Regex NER due to slowness and little use. CRF NER to replace NER.
---------------
Other
---------------
https://github.com/JohnSnowLabs/spark-nlp/commit/91d8acb1f0f4840dad86db3319d0b062bd63b8c6
Removed Regex NER due to slowness and little use. CRF NER to replace NER.
========
1.2.3
========
---------------
Bugfixes
---------------
* Sentence detection not properly bounding punctuation marks
* Sentence detection abbreviations feature disabled by default, since it is a slow feature and not necessary benefitial
* Sentence detection punctuation bounds improved algorithm performance by removing redundant formatting
* Fixed a bug causing sentiment analysis not to work when using normalized tokens that may cause tokens to be deleted from sentence
* Fixed Resource Helper text reading missing first line from text stream. More improvements to come later.
---------------
Other
---------------
CRF NER progress, word embeddings support. Not yet officially released.
========
1.2.2
========
---------------
New features
---------------
* Finisher parameter to export output as Array
========
1.2.1
========
---------------
Bugfixes
---------------
* Finisher not properly deleting annotation columns even when set to true explicitly
========
1.2.0
========
--------------
New features
--------------
* New annotator: CRF NER
* New transformer: Token Assembler
* Added custom bounds parameter in sentence detector
* Added custom pattern parameter in Normalizer
* SpellChecker able to train from text files as dataset
* Typesafe configuration may now be read from disk at runtime
-------------------------
Performance improvements
-------------------------
* Enabled and suggested KRYO Serializer for better pipeline write-read
--------------------
Code improvements
--------------------
* Pack/Unpack in annotators leads to better code reutilization
* ResourceHelper reduced io responsibility
* Annotations centralized main result and Finisher improvements
---------------
Bugfixes
---------------
* RegexMatcher fixed to read from input files properly
* DocumentAssembler fixed relative positioning
------------------
Release framework
------------------
* Better package release readiness for different repositories
* Organization package name change due to central's standards