JohnSnowLabs · danilojsl · Mar 6, 2025 · Mar 6, 2025 · Dec 17, 2024 · Dec 19, 2024
diff --git a/examples/python/reader/SparkNLP_Email_Reader_Demo.ipynb b/examples/python/reader/SparkNLP_Email_Reader_Demo.ipynb
diff --git a/examples/python/reader/SparkNLP_Excel_Reader_Demo.ipynb b/examples/python/reader/SparkNLP_Excel_Reader_Demo.ipynb
diff --git a/examples/python/reader/SparkNLP_HTML_Reader_Demo.ipynb b/examples/python/reader/SparkNLP_HTML_Reader_Demo.ipynb
diff --git a/examples/python/reader/SparkNLP_Word_Reader_Demo.ipynb b/examples/python/reader/SparkNLP_Word_Reader_Demo.ipynb
@@ -2,7 +2,9 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "id": "fVCTDXvj23JY"
+   },
    "source": [
     "![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)\n",
     "\n",
@@ -33,55 +35,63 @@
   },
   {
    "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "id": "_lK9PBpd23Je"
+   },
    "source": [
     "- Let's install and setup Spark NLP in Google Colab\n",
     "- This part is pretty easy via our simple script"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
+   "execution_count": 1,
+   "metadata": {
+    "id": "diBT0PwL23Je"
+   },
    "outputs": [],
    "source": [
     "! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash"
    ]
   },
   {
    "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "id": "HWjx-reJ23Jf"
+   },
    "source": [
     "For local files example we will download a couple of Word files from Spark NLP Github repo:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 7,
    "metadata": {
     "colab": {
      "base_uri": "https://localhost:8080/"
     },
     "id": "ya8qZe00dalC",
-    "outputId": "f6800bce-c101-47e3-8030-cf1a0b758183"
+    "outputId": "d4ac0a0d-edd7-4126-cf01-9ad5ed0500a3"
    },
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "--2024-12-11 02:43:35--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1094-Adding-support-to-read-Word-files-v2/src/test/resources/reader/doc/contains-pictures.docx\n",
-      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...\n",
-      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.\n",
+      "--2025-03-06 00:33:05--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/doc/contains-pictures.docx\n",
+      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...\n",
+      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
       "HTTP request sent, awaiting response... 200 OK\n",
       "Length: 95087 (93K) [application/octet-stream]\n",
       "Saving to: ‘word-files/contains-pictures.docx’\n",
       "\n",
-      "contains-pictures.d 100%[===================>]  92.86K  --.-KB/s    in 0.04s   \n",
+      "\r",
+      "contains-pictures.d   0%[                    ]       0  --.-KB/s               \r",
+      "contains-pictures.d 100%[===================>]  92.86K  --.-KB/s    in 0.02s   \n",
       "\n",
-      "2024-12-11 02:43:35 (2.47 MB/s) - ‘word-files/contains-pictures.docx’ saved [95087/95087]\n",
+      "2025-03-06 00:33:06 (3.86 MB/s) - ‘word-files/contains-pictures.docx’ saved [95087/95087]\n",
       "\n",
-      "--2024-12-11 02:43:36--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1094-Adding-support-to-read-Word-files-v2/src/test/resources/reader/doc/fake_table.docx\n",
+      "--2025-03-06 00:33:06--  https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/doc/fake_table.docx\n",
       "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...\n",
       "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.\n",
       "HTTP request sent, awaiting response... 200 OK\n",
@@ -90,7 +100,7 @@
       "\n",
       "fake_table.docx     100%[===================>]  12.10K  --.-KB/s    in 0s      \n",
       "\n",
-      "2024-12-11 02:43:36 (24.7 MB/s) - ‘word-files/fake_table.docx’ saved [12392/12392]\n",
+      "2025-03-06 00:33:06 (99.2 MB/s) - ‘word-files/fake_table.docx’ saved [12392/12392]\n",
       "\n"
      ]
     }
@@ -103,22 +113,22 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 9,
+   "execution_count": 8,
    "metadata": {
     "colab": {
      "base_uri": "https://localhost:8080/"
     },
     "id": "oZLpFt7qcWoC",
-    "outputId": "6e5ce0b8-383a-481c-9b7b-d4250d385f25"
+    "outputId": "4a0b4ef5-40e8-4020-e5f4-3a2002a0fc61"
    },
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
       "total 112K\n",
-      "-rw-r--r-- 1 root root 93K Dec 11 02:43 contains-pictures.docx\n",
-      "-rw-r--r-- 1 root root 13K Dec 11 02:43 fake_table.docx\n"
+      "-rw-r--r-- 1 root root 93K Mar  6 00:33 contains-pictures.docx\n",
+      "-rw-r--r-- 1 root root 13K Mar  6 00:33 fake_table.docx\n"
      ]
     }
    ],
@@ -138,13 +148,13 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 15,
    "metadata": {
     "colab": {
      "base_uri": "https://localhost:8080/"
     },
     "id": "_3GKYbmScehR",
-    "outputId": "24941880-c772-4b4e-dd0d-349fe8ea31c9"
+    "outputId": "8a0cba04-4db8-4705-ccb4-4c7b8f74fc99"
    },
    "outputs": [
     {
@@ -163,31 +173,31 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 16,
    "metadata": {
     "colab": {
      "base_uri": "https://localhost:8080/"
     },
     "id": "eKOYqIigmlmh",
-    "outputId": "1a3ec3b7-b49d-420b-cdaf-e4682b4f66e1"
+    "outputId": "f437fcf7-247e-4fda-d8cf-855c7fd6e6c3"
    },
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "+--------------------+\n",
-      "|                 doc|\n",
-      "+--------------------+\n",
-      "|[{Table, Header C...|\n",
-      "|[{Header, An inli...|\n",
-      "+--------------------+\n",
+      "+--------------------+--------------------+\n",
+      "|                path|                 doc|\n",
+      "+--------------------+--------------------+\n",
+      "|file:/content/wor...|[{Header, An inli...|\n",
+      "|file:/content/wor...|[{Table, Header C...|\n",
+      "+--------------------+--------------------+\n",
       "\n"
      ]
     }
    ],
    "source": [
-    "doc_df.select(\"doc\").show()"
+    "doc_df.show()"
    ]
   },
   {
@@ -198,7 +208,7 @@
      "base_uri": "https://localhost:8080/"
     },
     "id": "IoC1eqPPcmqN",
-    "outputId": "b994396c-b670-49af-8bb9-b5e6ff44e8fe"
+    "outputId": "73acbe65-0844-446a-f59a-6549dddfdd47"
    },
    "outputs": [
     {
@@ -207,7 +217,6 @@
      "text": [
       "root\n",
       " |-- path: string (nullable = true)\n",
-      " |-- content: binary (nullable = true)\n",
       " |-- doc: array (nullable = true)\n",
       " |    |-- element: struct (containsNull = true)\n",
       " |    |    |-- elementType: string (nullable = true)\n",
@@ -234,6 +243,56 @@
     "- HDFS: `hdfs://`\n",
     "- Microsoft Fabric OneLake: `abfss://`"
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "1DHIwRe13Ko7"
+   },
+   "source": [
+    "### Configuration Parameters"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "id": "FFnRYtys3Tv6"
+   },
+   "source": [
+    "- `storeContent`: By default, this is set to `false`. When enabled, the output will include the byte content of the file."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {
+    "colab": {
+     "base_uri": "https://localhost:8080/"
+    },
+    "id": "EY9qzmZu3NC8",
+    "outputId": "0d0916b1-b0ca-4c58-b723-dcad794cd3e3"
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Warning::Spark Session already created, some configs may not take.\n",
+      "+--------------------+--------------------+--------------------+\n",
+      "|                path|                 doc|             content|\n",
+      "+--------------------+--------------------+--------------------+\n",
+      "|file:/content/wor...|[{Header, An inli...|[50 4B 03 04 14 0...|\n",
+      "|file:/content/wor...|[{Table, Header C...|[50 4B 03 04 14 0...|\n",
+      "+--------------------+--------------------+--------------------+\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "params = {\"storeContent\": \"true\"}\n",
+    "doc_df = sparknlp.read(params).doc(\"./word-files\")\n",
+    "doc_df.show()"
+   ]
   }
  ],
  "metadata": {