Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark NLP 5.5.2 Release Candidate #14473

Merged
merged 13 commits into from
Dec 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,3 +1,37 @@
========
5.5.2
========
----------------
New Features & Enhancements
----------------
* OpenVINO Support for Transformers (PR #14408):
Added OpenVINO inference support to a broad range of transformer-based annotators, including DeBertaForQuestionAnswering, DeBertaForSequenceClassification, RoBertaForTokenClassification, XlmRobertaForZeroShotClassification, BartTransformer, GPT2Transformer, and many others.
* BLIPForQuestionAnswering Transformer (PR #14422):
Introduced a new transformer BLIPForQuestionAnswering for image-based question answering tasks. The transformer processes images alongside associated questions to provide relevant answers.
* AutoGGUFEmbeddings Annotator (PR #14433):
Added AutoGGUFEmbeddings to support embeddings from AutoGGUFModels, providing rich sentence embeddings. Includes an end-to-end example notebook for usage.
* HTML Parsing into DataFrame (PR #14449):
Introduced sparknlp.read().html() to parse local or remote HTML files and convert them into structured Spark DataFrames for easier analysis.
* Email Parsing into DataFrame (PR #14455):
Added sparknlp.read().email() method to parse email files into structured DataFrames, enabling scalable analysis of email content. (Note: Dependent on #14449)
* Microsoft Word Document Parsing into DataFrame (PR #14476):
Added a new feature to parse .docx and .doc files into a Spark DataFrame, streamlining the integration of Word documents into NLP pipelines.
* Microsoft Fabric Support (PR #14467):
Introduced support for leveraging Microsoft Fabric for word embeddings storage and retrieval, enhancing scalability and efficiency.
* cuDNN Upgrade Instructions on Databricks (PR #14451):
Added instructions on upgrading cuDNN for GPU inference and cleaned up redundant Databricks installation instructions.
* ChunkEmbeddings Metadata Preservation (PR #14462):
Modified ChunkEmbeddings to preserve the original chunk’s metadata in the resulting embeddings, ensuring richer contextual information is retained.
* Default Names and Languages for Annotators (PR #14469):
Updated default names and language configurations for newly created seq2seq annotators to improve consistency and clarity.

----------------
Bug Fixes
----------------
* Spark Version Errors (PR #14467):
Resolved issues related to long Spark versions when integrating Microsoft Fabric support.


========
5.5.1
========
Expand Down
20 changes: 10 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,15 +55,15 @@ documentation and examples

## Quick Start

This is a quick example of how to use Spark NLP pre-trained pipeline in Python and PySpark:
This is a quick example of how to use a Spark NLP pre-trained pipeline in Python and PySpark:

```sh
$ java -version
# should be Java 8 or 11 (Oracle or OpenJDK)
$ conda create -n sparknlp python=3.7 -y
$ conda activate sparknlp
# spark-nlp by default is based on pyspark 3.x
$ pip install spark-nlp==5.5.1 pyspark==3.3.1
$ pip install spark-nlp==5.5.2 pyspark==3.3.1
```

In Python console or Jupyter `Python3` kernel:
Expand Down Expand Up @@ -129,7 +129,7 @@ For a quick example of using pipelines and models take a look at our official [d

### Apache Spark Support

Spark NLP *5.5.1* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x
Spark NLP *5.5.2* has been built on top of Apache Spark 3.4 while fully supports Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x, and 3.5.x

| Spark NLP | Apache Spark 3.5.x | Apache Spark 3.4.x | Apache Spark 3.3.x | Apache Spark 3.2.x | Apache Spark 3.1.x | Apache Spark 3.0.x | Apache Spark 2.4.x | Apache Spark 2.3.x |
|-----------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
Expand Down Expand Up @@ -157,7 +157,7 @@ Find out more about 4.x `SparkNLP` versions in our official [documentation](http

### Databricks Support

Spark NLP 5.5.1 has been tested and is compatible with the following runtimes:
Spark NLP 5.5.2 has been tested and is compatible with the following runtimes:

| **CPU** | **GPU** |
|--------------------|--------------------|
Expand All @@ -174,7 +174,7 @@ We are compatible with older runtimes. For a full list check databricks support

### EMR Support

Spark NLP 5.5.1 has been tested and is compatible with the following EMR releases:
Spark NLP 5.5.2 has been tested and is compatible with the following EMR releases:

| **EMR Release** |
|--------------------|
Expand Down Expand Up @@ -205,7 +205,7 @@ deployed to Maven central. To add any of our packages as a dependency in your ap
from our official documentation.

If you are interested, there is a simple SBT project for Spark NLP to guide you on how to use it in your
projects [Spark NLP SBT S5.5.1r](https://github.com/maziyarpanahi/spark-nlp-starter)
projects [Spark NLP SBT S5.5.2r](https://github.com/maziyarpanahi/spark-nlp-starter)

### Python

Expand All @@ -214,7 +214,7 @@ Check all available installations for Python in our official [documentation](htt

### Compiled JARs

To compile the jars from source follow [these instructions](https://sparknlp.org/docs/en/compiled#jars) from our official documenation
To compile the jars from source follow [these instructions](https://sparknlp.org/docs/en/compiled#jars) from our official documentation

## Platform-Specific Instructions

Expand All @@ -234,7 +234,7 @@ For detailed instructions on how to use Spark NLP on supported platforms, please

Spark NLP library and all the pre-trained models/pipelines can be used entirely offline with no access to the Internet.
Please check [these instructions](https://sparknlp.org/docs/en/install#s3-integration) from our official documentation
to use Spark NLP offline
to use Spark NLP offline.

## Advanced Settings

Expand All @@ -250,7 +250,7 @@ In Spark NLP we can define S3 locations to:

Please check [these instructions](https://sparknlp.org/docs/en/install#s3-integration) from our official documentation.

## Document5.5.1
## Document5.5.2

### Examples

Expand Down Expand Up @@ -283,7 +283,7 @@ the Spark NLP library:
keywords = {Spark, Natural language processing, Deep learning, Tensorflow, Cluster},
abstract = {Spark NLP is a Natural Language Processing (NLP) library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment. Spark NLP comes with 1100+ pretrained pipelines and models in more than 192+ languages. It supports nearly all the NLP tasks and modules that can be used seamlessly in a cluster. Downloaded more than 2.7 million times and experiencing 9x growth since January 2020, Spark NLP is used by 54% of healthcare organizations as the world’s most widely used NLP library in the enterprise.}
}
}5.5.1
}5.5.2
```

## Community support
Expand Down
12 changes: 10 additions & 2 deletions build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ name := getPackageName(is_silicon, is_gpu, is_aarch64)

organization := "com.johnsnowlabs.nlp"

version := "5.5.1"
version := "5.5.2"

(ThisBuild / scalaVersion) := scalaVer

Expand Down Expand Up @@ -157,7 +157,14 @@ lazy val utilDependencies = Seq(
greex,
azureIdentity,
azureStorage,
jsoup)
jsoup,
jakartaMail,
angusMail,
poiDocx
exclude ("org.apache.logging.log4j", "log4j-api"),
scratchpad
exclude ("org.apache.logging.log4j", "log4j-api")
)

lazy val typedDependencyParserDependencies = Seq(junit)

Expand Down Expand Up @@ -230,6 +237,7 @@ lazy val root = (project in file("."))

(assembly / assemblyMergeStrategy) := {
case PathList("META-INF", "versions", "9", "module-info.class") => MergeStrategy.discard
case PathList("module-info.class") => MergeStrategy.discard // Discard any module-info.class globally
case PathList("apache.commons.lang3", _ @_*) => MergeStrategy.discard
case PathList("org.apache.hadoop", _ @_*) => MergeStrategy.first
case PathList("com.amazonaws", _ @_*) => MergeStrategy.last
Expand Down
2 changes: 1 addition & 1 deletion docs/_layouts/landing.html
Original file line number Diff line number Diff line change
Expand Up @@ -201,7 +201,7 @@ <h3 class="grey h3_title">{{ _section.title }}</h3>
<div class="highlight-box">
{% highlight bash %}
# Using PyPI
$ pip install spark-nlp==5.5.1
$ pip install spark-nlp==5.5.2

# Using Anaconda/Conda
$ conda install -c johnsnowlabs spark-nlp
Expand Down
16 changes: 13 additions & 3 deletions docs/en/advanced_settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ spark = SparkSession.builder
.config("spark.kryoserializer.buffer.max", "2000m")
.config("spark.jsl.settings.pretrained.cache_folder", "sample_data/pretrained")
.config("spark.jsl.settings.storage.cluster_tmp_dir", "sample_data/storage")
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.1")
.config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.2")
.getOrCreate()
```

Expand All @@ -66,7 +66,7 @@ spark-shell \
--conf spark.kryoserializer.buffer.max=2000M \
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.1
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.2
```

**pyspark:**
Expand All @@ -79,7 +79,7 @@ pyspark \
--conf spark.kryoserializer.buffer.max=2000M \
--conf spark.jsl.settings.pretrained.cache_folder="sample_data/pretrained" \
--conf spark.jsl.settings.storage.cluster_tmp_dir="sample_data/storage" \
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.1
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.2
```

**Databricks:**
Expand All @@ -96,6 +96,16 @@ spark.jsl.settings.annotator.log_folder dbfs:/PATH_TO_LOGS

NOTE: If this is an existing cluster, after adding new configs or changing existing properties you need to restart it.

#### Additional Configuration for Databricks
When running Email Reader feature `sparknlp.read().email("./email-files")` on Databricks, it is necessary to include the following Spark configurations to avoid dependency conflicts:

```bash
spark.driver.userClassPathFirst true
spark.executor.userClassPathFirst true
```
These configurations are required because the Databricks runtime environment includes a bundled version of the `com.sun.mail:jakarta.mail` library, which conflicts with `jakarta.activation`.
By setting these properties, the application ensures that the user-provided libraries take precedence over those bundled in the Databricks environment, resolving the dependency conflict.

</div><div class="h3-box" markdown="1">

### S3 Integration
Expand Down
123 changes: 123 additions & 0 deletions docs/en/annotator_entries/AutoGGUFEmbeddings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
{%- capture title -%}
AutoGGUFEmbeddings
{%- endcapture -%}

{%- capture description -%}
Annotator that uses the llama.cpp library to generate text embeddings with large language
models.

The type of embedding pooling can be set with the `setPoolingType` method. The default is
`"MEAN"`. The available options are `"NONE"`, `"MEAN"`, `"CLS"`, and `"LAST"`.

If the parameters are not set, the annotator will default to use the parameters provided by
the model.

Pretrained models can be loaded with `pretrained` of the companion object:

```scala
val autoGGUFEmbeddings = AutoGGUFEmbeddings.pretrained()
.setInputCols("document")
.setOutputCol("embeddings")
```

The default model is `"nomic-embed-text-v1.5.Q8_0.gguf"`, if no name is provided.

For available pretrained models please see the [Models Hub](https://sparknlp.org/models).

For extended examples of usage, see the
[AutoGGUFEmbeddingsTest](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/test/scala/com/johnsnowlabs/nlp/annotators/seq2seq/AutoGGUFEmbeddingsTest.scala)
and the
[example notebook](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples/python/llama.cpp/llama.cpp_in_Spark_NLP_AutoGGUFEmbeddings.ipynb).

**Note**: To use GPU inference with this annotator, make sure to use the Spark NLP GPU package and set
the number of GPU layers with the `setNGpuLayers` method.

When using larger models, we recommend adjusting GPU usage with `setNCtx` and `setNGpuLayers`
according to your hardware to avoid out-of-memory errors.
{%- endcapture -%}

{%- capture input_anno -%}
DOCUMENT
{%- endcapture -%}

{%- capture output_anno -%}
SENTENCE_EMBEDDINGS
{%- endcapture -%}

{%- capture python_example -%}
>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> document = DocumentAssembler() \
... .setInputCol("text") \
... .setOutputCol("document")
>>> autoGGUFEmbeddings = AutoGGUFEmbeddings.pretrained() \
... .setInputCols(["document"]) \
... .setOutputCol("completions") \
... .setBatchSize(4) \
... .setNGpuLayers(99) \
... .setPoolingType("MEAN")
>>> pipeline = Pipeline().setStages([document, autoGGUFEmbeddings])
>>> data = spark.createDataFrame([["The moons of Jupiter are 77 in total, with 79 confirmed natural satellites and 2 man-made ones."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("completions").show()
+--------------------------------------------------------------------------------+
| embeddings|
+--------------------------------------------------------------------------------+
|[[-0.034486726, 0.07770534, -0.15982522, -0.017873349, 0.013914132, 0.0365736...|
+--------------------------------------------------------------------------------+
{%- endcapture -%}

{%- capture scala_example -%}
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
import spark.implicits._

val document = new DocumentAssembler().setInputCol("text").setOutputCol("document")

val autoGGUFEmbeddings = AutoGGUFEmbeddings
.pretrained()
.setInputCols("document")
.setOutputCol("embeddings")
.setBatchSize(4)
.setPoolingType("MEAN")

val pipeline = new Pipeline().setStages(Array(document, autoGGUFEmbeddings))

val data = Seq(
"The moons of Jupiter are 77 in total, with 79 confirmed natural satellites and 2 man-made ones.")
.toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("embeddings.embeddings").show(1, truncate=80)
+--------------------------------------------------------------------------------+
| embeddings|
+--------------------------------------------------------------------------------+
|[[-0.034486726, 0.07770534, -0.15982522, -0.017873349, 0.013914132, 0.0365736...|
+--------------------------------------------------------------------------------+
{%- endcapture -%}

{%- capture api_link -%}
[AutoGGUFEmbeddings](/api/com/johnsnowlabs/nlp/embeddings/AutoGGUFEmbeddings)
{%- endcapture -%}

{%- capture python_api_link -%}
[AutoGGUFEmbeddings](/api/python/reference/autosummary/sparknlp/annotator/embeddings/auto_gguf_embeddings/index.html)
{%- endcapture -%}

{%- capture source_link -%}
[AutoGGUFEmbeddings](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/nlp/embeddings/AutoGGUFEmbeddings.scala)
{%- endcapture -%}

{% include templates/anno_template.md
title=title
description=description
input_anno=input_anno
output_anno=output_anno
python_example=python_example
scala_example=scala_example
api_link=api_link
python_api_link=python_api_link
source_link=source_link
%}
1 change: 1 addition & 0 deletions docs/en/annotators.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ There are two types of Annotators:
{:.table-model-big}
|Annotator|Description|Version |
|---|---|---|
{% include templates/anno_table_entry.md path="" name="AutoGGUFEmbeddings" summary="Annotator that uses the llama.cpp library to generate text embeddings with large language models."%}
{% include templates/anno_table_entry.md path="" name="AutoGGUFModel" summary="Annotator that uses the llama.cpp library to generate text completions with large language models."%}
{% include templates/anno_table_entry.md path="" name="BGEEmbeddings" summary="Sentence embeddings using BGE."%}
{% include templates/anno_table_entry.md path="" name="BigTextMatcher" summary="Annotator to match exact phrases (by token) provided in a file against a Document."%}
Expand Down
2 changes: 1 addition & 1 deletion docs/en/concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ $ java -version
$ conda create -n sparknlp python=3.7 -y
$ conda activate sparknlp
# spark-nlp by default is based on pyspark 3.x
$ pip install spark-nlp==5.5.1 pyspark==3.3.1 jupyter
$ pip install spark-nlp==5.5.2 pyspark==3.3.1 jupyter
$ jupyter notebook
```

Expand Down
4 changes: 2 additions & 2 deletions docs/en/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ $ java -version
# should be Java 8 (Oracle or OpenJDK)
$ conda create -n sparknlp python=3.7 -y
$ conda activate sparknlp
$ pip install spark-nlp==5.5.1 pyspark==3.3.1
$ pip install spark-nlp==5.5.2 pyspark==3.3.1
```

</div><div class="h3-box" markdown="1">
Expand All @@ -40,7 +40,7 @@ This script comes with the two options to define `pyspark` and `spark-nlp` versi
# -p is for pyspark
# -s is for spark-nlp
# by default they are set to the latest
!bash colab.sh -p 3.2.3 -s 5.5.1
!bash colab.sh -p 3.2.3 -s 5.5.2
```

[Spark NLP quick start on Google Colab](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/quick_start_google_colab.ipynb) is a live demo on Google Colab that performs named entity recognitions and sentiment analysis by using Spark NLP pretrained pipelines.
Expand Down
2 changes: 1 addition & 1 deletion docs/en/hardware_acceleration.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ Since the new Transformer models such as BERT for Word and Sentence embeddings a
| DeBERTa Large | +477%(5.8x) |
| Longformer Base | +52%(1.5x) |

Spark NLP 5.5.1 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support:
Spark NLP 5.5.2 is built with TensorFlow 2.7.1 and the following NVIDIA® software are only required for GPU support:

- NVIDIA® GPU drivers version 450.80.02 or higher
- CUDA® Toolkit 11.2
Expand Down
Loading
Loading