Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark NLP 5.5.2 Release Candidate #14473

Merged
merged 13 commits into from
Dec 18, 2024
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ documentation and examples

## Quick Start

This is a quick example of how to use Spark NLP pre-trained pipeline in Python and PySpark:
This is a quick example of how to use a Spark NLP pre-trained pipeline in Python and PySpark:

```sh
$ java -version
Expand Down Expand Up @@ -214,7 +214,7 @@ Check all available installations for Python in our official [documentation](htt

### Compiled JARs

To compile the jars from source follow [these instructions](https://sparknlp.org/docs/en/compiled#jars) from our official documenation
To compile the jars from source follow [these instructions](https://sparknlp.org/docs/en/compiled#jars) from our official documentation

## Platform-Specific Instructions

Expand All @@ -234,7 +234,7 @@ For detailed instructions on how to use Spark NLP on supported platforms, please

Spark NLP library and all the pre-trained models/pipelines can be used entirely offline with no access to the Internet.
Please check [these instructions](https://sparknlp.org/docs/en/install#s3-integration) from our official documentation
to use Spark NLP offline
to use Spark NLP offline.

## Advanced Settings

Expand Down
10 changes: 9 additions & 1 deletion build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,14 @@ lazy val utilDependencies = Seq(
greex,
azureIdentity,
azureStorage,
jsoup)
jsoup,
jakartaMail,
angusMail,
poiDocx
exclude ("org.apache.logging.log4j", "log4j-api"),
scratchpad
exclude ("org.apache.logging.log4j", "log4j-api")
)

lazy val typedDependencyParserDependencies = Seq(junit)

Expand Down Expand Up @@ -230,6 +237,7 @@ lazy val root = (project in file("."))

(assembly / assemblyMergeStrategy) := {
case PathList("META-INF", "versions", "9", "module-info.class") => MergeStrategy.discard
case PathList("module-info.class") => MergeStrategy.discard // Discard any module-info.class globally
case PathList("apache.commons.lang3", _ @_*) => MergeStrategy.discard
case PathList("org.apache.hadoop", _ @_*) => MergeStrategy.first
case PathList("com.amazonaws", _ @_*) => MergeStrategy.last
Expand Down
10 changes: 10 additions & 0 deletions docs/en/advanced_settings.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,16 @@ spark.jsl.settings.annotator.log_folder dbfs:/PATH_TO_LOGS

NOTE: If this is an existing cluster, after adding new configs or changing existing properties you need to restart it.

#### Additional Configuration for Databricks
When running Email Reader feature `sparknlp.read().email("./email-files")` on Databricks, it is necessary to include the following Spark configurations to avoid dependency conflicts:

```bash
spark.driver.userClassPathFirst true
spark.executor.userClassPathFirst true
```
These configurations are required because the Databricks runtime environment includes a bundled version of the `com.sun.mail:jakarta.mail` library, which conflicts with `jakarta.activation`.
By setting these properties, the application ensures that the user-provided libraries take precedence over those bundled in the Databricks environment, resolving the dependency conflict.

</div><div class="h3-box" markdown="1">

### S3 Integration
Expand Down
123 changes: 123 additions & 0 deletions docs/en/annotator_entries/AutoGGUFEmbeddings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
{%- capture title -%}
AutoGGUFEmbeddings
{%- endcapture -%}

{%- capture description -%}
Annotator that uses the llama.cpp library to generate text embeddings with large language
models.

The type of embedding pooling can be set with the `setPoolingType` method. The default is
`"MEAN"`. The available options are `"NONE"`, `"MEAN"`, `"CLS"`, and `"LAST"`.

If the parameters are not set, the annotator will default to use the parameters provided by
the model.

Pretrained models can be loaded with `pretrained` of the companion object:

```scala
val autoGGUFEmbeddings = AutoGGUFEmbeddings.pretrained()
.setInputCols("document")
.setOutputCol("embeddings")
```

The default model is `"nomic-embed-text-v1.5.Q8_0.gguf"`, if no name is provided.

For available pretrained models please see the [Models Hub](https://sparknlp.org/models).

For extended examples of usage, see the
[AutoGGUFEmbeddingsTest](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/test/scala/com/johnsnowlabs/nlp/annotators/seq2seq/AutoGGUFEmbeddingsTest.scala)
and the
[example notebook](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples/python/llama.cpp/llama.cpp_in_Spark_NLP_AutoGGUFEmbeddings.ipynb).

**Note**: To use GPU inference with this annotator, make sure to use the Spark NLP GPU package and set
the number of GPU layers with the `setNGpuLayers` method.

When using larger models, we recommend adjusting GPU usage with `setNCtx` and `setNGpuLayers`
according to your hardware to avoid out-of-memory errors.
{%- endcapture -%}

{%- capture input_anno -%}
DOCUMENT
{%- endcapture -%}

{%- capture output_anno -%}
SENTENCE_EMBEDDINGS
{%- endcapture -%}

{%- capture python_example -%}
>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> document = DocumentAssembler() \
... .setInputCol("text") \
... .setOutputCol("document")
>>> autoGGUFEmbeddings = AutoGGUFEmbeddings.pretrained() \
... .setInputCols(["document"]) \
... .setOutputCol("completions") \
... .setBatchSize(4) \
... .setNGpuLayers(99) \
... .setPoolingType("MEAN")
>>> pipeline = Pipeline().setStages([document, autoGGUFEmbeddings])
>>> data = spark.createDataFrame([["The moons of Jupiter are 77 in total, with 79 confirmed natural satellites and 2 man-made ones."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("completions").show()
+--------------------------------------------------------------------------------+
| embeddings|
+--------------------------------------------------------------------------------+
|[[-0.034486726, 0.07770534, -0.15982522, -0.017873349, 0.013914132, 0.0365736...|
+--------------------------------------------------------------------------------+
{%- endcapture -%}

{%- capture scala_example -%}
import com.johnsnowlabs.nlp.base._
import com.johnsnowlabs.nlp.annotator._
import org.apache.spark.ml.Pipeline
import spark.implicits._

val document = new DocumentAssembler().setInputCol("text").setOutputCol("document")

val autoGGUFEmbeddings = AutoGGUFEmbeddings
.pretrained()
.setInputCols("document")
.setOutputCol("embeddings")
.setBatchSize(4)
.setPoolingType("MEAN")

val pipeline = new Pipeline().setStages(Array(document, autoGGUFEmbeddings))

val data = Seq(
"The moons of Jupiter are 77 in total, with 79 confirmed natural satellites and 2 man-made ones.")
.toDF("text")
val result = pipeline.fit(data).transform(data)
result.select("embeddings.embeddings").show(1, truncate=80)
+--------------------------------------------------------------------------------+
| embeddings|
+--------------------------------------------------------------------------------+
|[[-0.034486726, 0.07770534, -0.15982522, -0.017873349, 0.013914132, 0.0365736...|
+--------------------------------------------------------------------------------+
{%- endcapture -%}

{%- capture api_link -%}
[AutoGGUFEmbeddings](/api/com/johnsnowlabs/nlp/embeddings/AutoGGUFEmbeddings)
{%- endcapture -%}

{%- capture python_api_link -%}
[AutoGGUFEmbeddings](/api/python/reference/autosummary/sparknlp/annotator/embeddings/auto_gguf_embeddings/index.html)
{%- endcapture -%}

{%- capture source_link -%}
[AutoGGUFEmbeddings](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/nlp/embeddings/AutoGGUFEmbeddings.scala)
{%- endcapture -%}

{% include templates/anno_template.md
title=title
description=description
input_anno=input_anno
output_anno=output_anno
python_example=python_example
scala_example=scala_example
api_link=api_link
python_api_link=python_api_link
source_link=source_link
%}
1 change: 1 addition & 0 deletions docs/en/annotators.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ There are two types of Annotators:
{:.table-model-big}
|Annotator|Description|Version |
|---|---|---|
{% include templates/anno_table_entry.md path="" name="AutoGGUFEmbeddings" summary="Annotator that uses the llama.cpp library to generate text embeddings with large language models."%}
{% include templates/anno_table_entry.md path="" name="AutoGGUFModel" summary="Annotator that uses the llama.cpp library to generate text completions with large language models."%}
{% include templates/anno_table_entry.md path="" name="BGEEmbeddings" summary="Sentence embeddings using BGE."%}
{% include templates/anno_table_entry.md path="" name="BigTextMatcher" summary="Annotator to match exact phrases (by token) provided in a file against a Document."%}
Expand Down
70 changes: 32 additions & 38 deletions docs/en/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -620,6 +620,8 @@ pointed [here](#python-without-explicit-pyspark-installation)

## Databricks Cluster

### Install Spark NLP on Databricks

1. Create a cluster if you don't have one already

2. On a new cluster or existing one you need to add the following to the `Advanced Options -> Spark` tab:
Expand All @@ -631,15 +633,37 @@ pointed [here](#python-without-explicit-pyspark-installation)

3. In `Libraries` tab inside your cluster you need to follow these steps:

3.1. Install New -> PyPI -> `spark-nlp==5.5.1` -> Install
3.1. Install New -> PyPI -> `spark-nlp==5.5.1` -> Install

3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.1` -> Install
3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.1` -> Install

4. Now you can attach your notebook to the cluster and use Spark NLP!

NOTE: Databricks' runtimes support different Apache Spark major releases. Please make sure you choose the correct Spark
NLP Maven package name (Maven Coordinate) for your runtime from
our [Packages Cheatsheet](https://github.com/JohnSnowLabs/spark-nlp#packages-cheatsheet)
NOTE: Databricks' runtimes support different Apache Spark major releases. Please make sure you choose the correct Spark NLP Maven package name (Maven Coordinate) for your runtime from our [Packages Cheatsheet](https://github.com/JohnSnowLabs/spark-nlp#packages-cheatsheet)

#### ONNX GPU Inference on Databricks

To run infer ONNX models with GPU on Databricks clusters, we need to perform some additional setup steps. ONNX requires CUDA 12 and cuDNN 9 to be installed.

Therefore, we need to use Databricks runtimes starting from version 15, as these come with CUDA 12. However, they come with cuDNN 8, which we need to upgrade manually.
To do so, we have to add the following script as an [init script](https://docs.databricks.com/en/init-scripts/index.html):

```bash
#!/bin/bash
sudo apt-get update && sudo apt-get -y install cudnn9-cuda-12
```

You need to save this script to a shell script file (i.e. `upgrade-cudnn9.sh`) in your workspace. Afterwards, you need to specify it on your compute resource under the *Advanced options* section. cuDNN will be upgraded to version 9 on all nodes before Spark is started.

</div><div class="h3-box" markdown="1">

### Databricks Notebooks

You can view all the Databricks notebooks from this address:

[https://johnsnowlabs.github.io/spark-nlp-workshop/databricks/index.html](https://johnsnowlabs.github.io/spark-nlp-workshop/databricks/index.html)

Note: You can import these notebooks by using their URLs.

</div><div class="h3-box" markdown="1">

Expand Down Expand Up @@ -849,6 +873,8 @@ Spark NLP 5.5.1 has been tested and is compatible with the following runtimes:
- 14.0 ML
- 14.1
- 14.1 ML
- 15.x
- 15.x ML

**GPU:**

Expand All @@ -871,39 +897,7 @@ Spark NLP 5.5.1 has been tested and is compatible with the following runtimes:
- 13.3 ML & GPU
- 14.0 ML & GPU
- 14.1 ML & GPU

</div><div class="h3-box" markdown="1">

#### Install Spark NLP on Databricks

1. Create a cluster if you don't have one already

2. On a new cluster or existing one you need to add the following to the `Advanced Options -> Spark` tab:

```bash
spark.kryoserializer.buffer.max 2000M
spark.serializer org.apache.spark.serializer.KryoSerializer
```

3. In `Libraries` tab inside your cluster you need to follow these steps:

3.1. Install New -> PyPI -> `spark-nlp` -> Install5.5.1

3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.1` -> Install

4. Now you can attach your notebook to the cluster and use Spark NLP!

NOTE: Databrick's runtimes support different Apache Spark major releases. Please make sure you choose the correct Spark NLP Maven pacakge name (Maven Coordinate) for your runtime from our [Packages Cheatsheet](https://github.com/JohnSnowLabs/spark-nlp#packages-cheatsheet)

</div><div class="h3-box" markdown="1">

#### Databricks Notebooks

You can view all the Databricks notebooks from this address:

[https://johnsnowlabs.github.io/spark-nlp-workshop/databricks/index.html](https://johnsnowlabs.github.io/spark-nlp-workshop/databricks/index.html)

Note: You can import these notebooks by using their URLs.
- 15.x ML & GPU

</div><div class="h3-box" markdown="1">

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -251,7 +251,7 @@
"provenance": []
},
"kernelspec": {
"display_name": "Python 3",
"display_name": "sparknlp_dev",
"language": "python",
"name": "python3"
},
Expand All @@ -264,7 +264,8 @@
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
Expand Down
Loading