JohnSnowLabs · maziyarpanahi · Dec 18, 2024 · Dec 9, 2024 · Dec 9, 2024 · Dec 9, 2024
diff --git a/README.md b/README.md
@@ -55,7 +55,7 @@ documentation and examples
 
 ## Quick Start
 
-This is a quick example of how to use Spark NLP pre-trained pipeline in Python and PySpark:
+This is a quick example of how to use a Spark NLP pre-trained pipeline in Python and PySpark:
 
 ```sh
 $ java -version
@@ -214,7 +214,7 @@ Check all available installations for Python in our official [documentation](htt
 
 ### Compiled JARs
 
-To compile the jars from source follow [these instructions](https://sparknlp.org/docs/en/compiled#jars) from our official documenation
+To compile the jars from source follow [these instructions](https://sparknlp.org/docs/en/compiled#jars) from our official documentation
 
 ## Platform-Specific Instructions
 
@@ -234,7 +234,7 @@ For detailed instructions on how to use Spark NLP on supported platforms, please
 
 Spark NLP library and all the pre-trained models/pipelines can be used entirely offline with no access to the Internet.
 Please check [these instructions](https://sparknlp.org/docs/en/install#s3-integration) from our official documentation
-to use Spark NLP offline
+to use Spark NLP offline.
 
 ## Advanced Settings
 

diff --git a/build.sbt b/build.sbt
@@ -157,7 +157,14 @@ lazy val utilDependencies = Seq(
   greex,
   azureIdentity,
   azureStorage,
-  jsoup)
+  jsoup,
+  jakartaMail,
+  angusMail,
+  poiDocx
+    exclude ("org.apache.logging.log4j", "log4j-api"),
+  scratchpad
+    exclude ("org.apache.logging.log4j", "log4j-api")
+)
 
 lazy val typedDependencyParserDependencies = Seq(junit)
 
@@ -230,6 +237,7 @@ lazy val root = (project in file("."))
 
 (assembly / assemblyMergeStrategy) := {
   case PathList("META-INF", "versions", "9", "module-info.class") => MergeStrategy.discard
+  case PathList("module-info.class") => MergeStrategy.discard // Discard any module-info.class globally
   case PathList("apache.commons.lang3", _ @_*) => MergeStrategy.discard
   case PathList("org.apache.hadoop", _ @_*) => MergeStrategy.first
   case PathList("com.amazonaws", _ @_*) => MergeStrategy.last

diff --git a/docs/en/advanced_settings.md b/docs/en/advanced_settings.md
@@ -96,6 +96,16 @@ spark.jsl.settings.annotator.log_folder dbfs:/PATH_TO_LOGS
 
 NOTE: If this is an existing cluster, after adding new configs or changing existing properties you need to restart it.
 
+#### Additional Configuration for Databricks
+When running Email Reader feature `sparknlp.read().email("./email-files")` on Databricks, it is necessary to include the following Spark configurations to avoid dependency conflicts:
+
+```bash
+spark.driver.userClassPathFirst true
+spark.executor.userClassPathFirst true
+```
+These configurations are required because the Databricks runtime environment includes a bundled version of the `com.sun.mail:jakarta.mail` library, which conflicts with `jakarta.activation`.
+By setting these properties, the application ensures that the user-provided libraries take precedence over those bundled in the Databricks environment, resolving the dependency conflict.
+
 </div><div class="h3-box" markdown="1">
 
 ### S3 Integration

diff --git a/docs/en/annotator_entries/AutoGGUFEmbeddings.md b/docs/en/annotator_entries/AutoGGUFEmbeddings.md
@@ -0,0 +1,123 @@
+{%- capture title -%}
+AutoGGUFEmbeddings
+{%- endcapture -%}
+
+{%- capture description -%}
+Annotator that uses the llama.cpp library to generate text embeddings with large language
+models.
+
+The type of embedding pooling can be set with the `setPoolingType` method. The default is
+`"MEAN"`. The available options are `"NONE"`, `"MEAN"`, `"CLS"`, and `"LAST"`.
+
+If the parameters are not set, the annotator will default to use the parameters provided by
+the model.
+
+Pretrained models can be loaded with `pretrained` of the companion object:
+
+```scala
+val autoGGUFEmbeddings = AutoGGUFEmbeddings.pretrained()
+  .setInputCols("document")
+  .setOutputCol("embeddings")
+```
+
+The default model is `"nomic-embed-text-v1.5.Q8_0.gguf"`, if no name is provided.
+
+For available pretrained models please see the [Models Hub](https://sparknlp.org/models).
+
+For extended examples of usage, see the
+[AutoGGUFEmbeddingsTest](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/test/scala/com/johnsnowlabs/nlp/annotators/seq2seq/AutoGGUFEmbeddingsTest.scala)
+and the
+[example notebook](https://github.com/JohnSnowLabs/spark-nlp/tree/master/examples/python/llama.cpp/llama.cpp_in_Spark_NLP_AutoGGUFEmbeddings.ipynb).
+
+**Note**: To use GPU inference with this annotator, make sure to use the Spark NLP GPU package and set
+the number of GPU layers with the `setNGpuLayers` method.
+
+When using larger models, we recommend adjusting GPU usage with `setNCtx` and `setNGpuLayers`
+according to your hardware to avoid out-of-memory errors.
+{%- endcapture -%}
+
+{%- capture input_anno -%}
+DOCUMENT
+{%- endcapture -%}
+
+{%- capture output_anno -%}
+SENTENCE_EMBEDDINGS
+{%- endcapture -%}
+
+{%- capture python_example -%}
+>>> import sparknlp
+>>> from sparknlp.base import *
+>>> from sparknlp.annotator import *
+>>> from pyspark.ml import Pipeline
+>>> document = DocumentAssembler() \
+...     .setInputCol("text") \
+...     .setOutputCol("document")
+>>> autoGGUFEmbeddings = AutoGGUFEmbeddings.pretrained() \
+...     .setInputCols(["document"]) \
+...     .setOutputCol("completions") \
+...     .setBatchSize(4) \
+...     .setNGpuLayers(99) \
+...     .setPoolingType("MEAN")
+>>> pipeline = Pipeline().setStages([document, autoGGUFEmbeddings])
+>>> data = spark.createDataFrame([["The moons of Jupiter are 77 in total, with 79 confirmed natural satellites and 2 man-made ones."]]).toDF("text")
+>>> result = pipeline.fit(data).transform(data)
+>>> result.select("completions").show()
++--------------------------------------------------------------------------------+
+|                                                                      embeddings|
++--------------------------------------------------------------------------------+
+|[[-0.034486726, 0.07770534, -0.15982522, -0.017873349, 0.013914132, 0.0365736...|
++--------------------------------------------------------------------------------+
+{%- endcapture -%}
+
+{%- capture scala_example -%}
+import com.johnsnowlabs.nlp.base._
+import com.johnsnowlabs.nlp.annotator._
+import org.apache.spark.ml.Pipeline
+import spark.implicits._
+
+val document = new DocumentAssembler().setInputCol("text").setOutputCol("document")
+
+val autoGGUFEmbeddings = AutoGGUFEmbeddings
+  .pretrained()
+  .setInputCols("document")
+  .setOutputCol("embeddings")
+  .setBatchSize(4)
+  .setPoolingType("MEAN")
+
+val pipeline = new Pipeline().setStages(Array(document, autoGGUFEmbeddings))
+
+val data = Seq(
+  "The moons of Jupiter are 77 in total, with 79 confirmed natural satellites and 2 man-made ones.")
+  .toDF("text")
+val result = pipeline.fit(data).transform(data)
+result.select("embeddings.embeddings").show(1, truncate=80)
++--------------------------------------------------------------------------------+
+|                                                                      embeddings|
++--------------------------------------------------------------------------------+
+|[[-0.034486726, 0.07770534, -0.15982522, -0.017873349, 0.013914132, 0.0365736...|
++--------------------------------------------------------------------------------+
+{%- endcapture -%}
+
+{%- capture api_link -%}
+[AutoGGUFEmbeddings](/api/com/johnsnowlabs/nlp/embeddings/AutoGGUFEmbeddings)
+{%- endcapture -%}
+
+{%- capture python_api_link -%}
+[AutoGGUFEmbeddings](/api/python/reference/autosummary/sparknlp/annotator/embeddings/auto_gguf_embeddings/index.html)
+{%- endcapture -%}
+
+{%- capture source_link -%}
+[AutoGGUFEmbeddings](https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/nlp/embeddings/AutoGGUFEmbeddings.scala)
+{%- endcapture -%}
+
+{% include templates/anno_template.md
+title=title
+description=description
+input_anno=input_anno
+output_anno=output_anno
+python_example=python_example
+scala_example=scala_example
+api_link=api_link
+python_api_link=python_api_link
+source_link=source_link
+%}
diff --git a/docs/en/annotator_entries/AutoGGUF.md → docs/en/annotator_entries/AutoGGUFModel.md b/docs/en/annotator_entries/AutoGGUF.md → docs/en/annotator_entries/AutoGGUFModel.md
diff --git a/docs/en/annotators.md b/docs/en/annotators.md
@@ -45,6 +45,7 @@ There are two types of Annotators:
 {:.table-model-big}
 |Annotator|Description|Version |
 |---|---|---|
+{% include templates/anno_table_entry.md path="" name="AutoGGUFEmbeddings" summary="Annotator that uses the llama.cpp library to generate text embeddings with large language models."%}
 {% include templates/anno_table_entry.md path="" name="AutoGGUFModel" summary="Annotator that uses the llama.cpp library to generate text completions with large language models."%}
 {% include templates/anno_table_entry.md path="" name="BGEEmbeddings" summary="Sentence embeddings using BGE."%}
 {% include templates/anno_table_entry.md path="" name="BigTextMatcher" summary="Annotator to match exact phrases (by token) provided in a file against a Document."%}

diff --git a/docs/en/install.md b/docs/en/install.md
@@ -620,6 +620,8 @@ pointed [here](#python-without-explicit-pyspark-installation)
 
 ## Databricks Cluster
 
+### Install Spark NLP on Databricks
+
 1. Create a cluster if you don't have one already
 
 2. On a new cluster or existing one you need to add the following to the `Advanced Options -> Spark` tab:
@@ -631,15 +633,37 @@ pointed [here](#python-without-explicit-pyspark-installation)
 
 3. In `Libraries` tab inside your cluster you need to follow these steps:
 
-   3.1. Install New -> PyPI -> `spark-nlp==5.5.1` -> Install
+    3.1. Install New -> PyPI -> `spark-nlp==5.5.1` -> Install
 
-   3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.1` -> Install
+    3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.1` -> Install
 
 4. Now you can attach your notebook to the cluster and use Spark NLP!
 
-NOTE: Databricks' runtimes support different Apache Spark major releases. Please make sure you choose the correct Spark
-NLP Maven package name (Maven Coordinate) for your runtime from
-our [Packages Cheatsheet](https://github.com/JohnSnowLabs/spark-nlp#packages-cheatsheet)
+NOTE: Databricks' runtimes support different Apache Spark major releases. Please make sure you choose the correct Spark NLP Maven package name (Maven Coordinate) for your runtime from our [Packages Cheatsheet](https://github.com/JohnSnowLabs/spark-nlp#packages-cheatsheet)
+
+#### ONNX GPU Inference on Databricks
+
+To run infer ONNX models with GPU on Databricks clusters, we need to perform some additional setup steps. ONNX requires CUDA 12 and cuDNN 9 to be installed.
+
+Therefore, we need to use Databricks runtimes starting from version 15, as these come with CUDA 12. However, they come with cuDNN 8, which we need to upgrade manually.
+To do so, we have to add the following script as an [init script](https://docs.databricks.com/en/init-scripts/index.html):
+
+```bash
+#!/bin/bash
+sudo apt-get update && sudo apt-get -y install cudnn9-cuda-12
+```
+
+You need to save this script to a shell script file (i.e. `upgrade-cudnn9.sh`) in your workspace. Afterwards, you need to specify it on your compute resource under the *Advanced options* section. cuDNN will be upgraded to version 9 on all nodes before Spark is started.
+
+</div><div class="h3-box" markdown="1">
+
+### Databricks Notebooks
+
+You can view all the Databricks notebooks from this address:
+
+[https://johnsnowlabs.github.io/spark-nlp-workshop/databricks/index.html](https://johnsnowlabs.github.io/spark-nlp-workshop/databricks/index.html)
+
+Note: You can import these notebooks by using their URLs.
 
 </div><div class="h3-box" markdown="1">
 
@@ -849,6 +873,8 @@ Spark NLP 5.5.1 has been tested and is compatible with the following runtimes:
 - 14.0 ML
 - 14.1
 - 14.1 ML
+- 15.x
+- 15.x ML
 
 **GPU:**
 
@@ -871,39 +897,7 @@ Spark NLP 5.5.1 has been tested and is compatible with the following runtimes:
 - 13.3 ML & GPU
 - 14.0 ML & GPU
 - 14.1 ML & GPU
-
-</div><div class="h3-box" markdown="1">
-
-#### Install Spark NLP on Databricks
-
-1. Create a cluster if you don't have one already
-
-2. On a new cluster or existing one you need to add the following to the `Advanced Options -> Spark` tab:
-
-    ```bash
-    spark.kryoserializer.buffer.max 2000M
-    spark.serializer org.apache.spark.serializer.KryoSerializer
-    ```
-
-3. In `Libraries` tab inside your cluster you need to follow these steps:
-
-    3.1. Install New -> PyPI -> `spark-nlp` -> Install5.5.1
-
-    3.2. Install New -> Maven -> Coordinates -> `com.johnsnowlabs.nlp:spark-nlp_2.12:5.5.1` -> Install
-
-4. Now you can attach your notebook to the cluster and use Spark NLP!
-
-NOTE: Databrick's runtimes support different Apache Spark major releases. Please make sure you choose the correct Spark NLP Maven pacakge name (Maven Coordinate) for your runtime from our [Packages Cheatsheet](https://github.com/JohnSnowLabs/spark-nlp#packages-cheatsheet)
-
-</div><div class="h3-box" markdown="1">
-
-#### Databricks Notebooks
-
-You can view all the Databricks notebooks from this address:
-
-[https://johnsnowlabs.github.io/spark-nlp-workshop/databricks/index.html](https://johnsnowlabs.github.io/spark-nlp-workshop/databricks/index.html)
-
-Note: You can import these notebooks by using their URLs.
+- 15.x ML & GPU
 
 </div><div class="h3-box" markdown="1">
 

diff --git a/examples/python/llama.cpp/PromptAssember_with_AutoGGUFModel.ipynb b/examples/python/llama.cpp/PromptAssember_with_AutoGGUFModel.ipynb
@@ -251,7 +251,7 @@
    "provenance": []
   },
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "sparknlp_dev",
    "language": "python",
    "name": "python3"
   },
@@ -264,7 +264,8 @@
    "mimetype": "text/x-python",
    "name": "python",
    "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3"
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
   }
  },
  "nbformat": 4,