Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARKNLP-1102] Adding support to read Excel files #14489

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
320 changes: 277 additions & 43 deletions examples/python/reader/SparkNLP_Email_Reader_Demo.ipynb

Large diffs are not rendered by default.

329 changes: 329 additions & 0 deletions examples/python/reader/SparkNLP_Excel_Reader_Demo.ipynb

Large diffs are not rendered by default.

285 changes: 238 additions & 47 deletions examples/python/reader/SparkNLP_HTML_Reader_Demo.ipynb

Large diffs are not rendered by default.

121 changes: 90 additions & 31 deletions examples/python/reader/SparkNLP_Word_Reader_Demo.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"id": "fVCTDXvj23JY"
},
"source": [
"![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)\n",
"\n",
Expand Down Expand Up @@ -33,55 +35,63 @@
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"id": "_lK9PBpd23Je"
},
"source": [
"- Let's install and setup Spark NLP in Google Colab\n",
"- This part is pretty easy via our simple script"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"execution_count": 1,
"metadata": {
"id": "diBT0PwL23Je"
},
"outputs": [],
"source": [
"! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash"
]
},
{
"cell_type": "markdown",
"metadata": {},
"metadata": {
"id": "HWjx-reJ23Jf"
},
"source": [
"For local files example we will download a couple of Word files from Spark NLP Github repo:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 7,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "ya8qZe00dalC",
"outputId": "f6800bce-c101-47e3-8030-cf1a0b758183"
"outputId": "d4ac0a0d-edd7-4126-cf01-9ad5ed0500a3"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2024-12-11 02:43:35-- https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1094-Adding-support-to-read-Word-files-v2/src/test/resources/reader/doc/contains-pictures.docx\n",
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...\n",
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.\n",
"--2025-03-06 00:33:05-- https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/doc/contains-pictures.docx\n",
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...\n",
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 95087 (93K) [application/octet-stream]\n",
"Saving to: ‘word-files/contains-pictures.docx’\n",
"\n",
"contains-pictures.d 100%[===================>] 92.86K --.-KB/s in 0.04s \n",
"\r",
"contains-pictures.d 0%[ ] 0 --.-KB/s \r",
"contains-pictures.d 100%[===================>] 92.86K --.-KB/s in 0.02s \n",
"\n",
"2024-12-11 02:43:35 (2.47 MB/s) - ‘word-files/contains-pictures.docx’ saved [95087/95087]\n",
"2025-03-06 00:33:06 (3.86 MB/s) - ‘word-files/contains-pictures.docx’ saved [95087/95087]\n",
"\n",
"--2024-12-11 02:43:36-- https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1094-Adding-support-to-read-Word-files-v2/src/test/resources/reader/doc/fake_table.docx\n",
"--2025-03-06 00:33:06-- https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/doc/fake_table.docx\n",
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.111.133, 185.199.110.133, ...\n",
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
Expand All @@ -90,7 +100,7 @@
"\n",
"fake_table.docx 100%[===================>] 12.10K --.-KB/s in 0s \n",
"\n",
"2024-12-11 02:43:36 (24.7 MB/s) - ‘word-files/fake_table.docx’ saved [12392/12392]\n",
"2025-03-06 00:33:06 (99.2 MB/s) - ‘word-files/fake_table.docx’ saved [12392/12392]\n",
"\n"
]
}
Expand All @@ -103,22 +113,22 @@
},
{
"cell_type": "code",
"execution_count": 9,
"execution_count": 8,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "oZLpFt7qcWoC",
"outputId": "6e5ce0b8-383a-481c-9b7b-d4250d385f25"
"outputId": "4a0b4ef5-40e8-4020-e5f4-3a2002a0fc61"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"total 112K\n",
"-rw-r--r-- 1 root root 93K Dec 11 02:43 contains-pictures.docx\n",
"-rw-r--r-- 1 root root 13K Dec 11 02:43 fake_table.docx\n"
"-rw-r--r-- 1 root root 93K Mar 6 00:33 contains-pictures.docx\n",
"-rw-r--r-- 1 root root 13K Mar 6 00:33 fake_table.docx\n"
]
}
],
Expand All @@ -138,13 +148,13 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 15,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "_3GKYbmScehR",
"outputId": "24941880-c772-4b4e-dd0d-349fe8ea31c9"
"outputId": "8a0cba04-4db8-4705-ccb4-4c7b8f74fc99"
},
"outputs": [
{
Expand All @@ -163,31 +173,31 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 16,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "eKOYqIigmlmh",
"outputId": "1a3ec3b7-b49d-420b-cdaf-e4682b4f66e1"
"outputId": "f437fcf7-247e-4fda-d8cf-855c7fd6e6c3"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+\n",
"| doc|\n",
"+--------------------+\n",
"|[{Table, Header C...|\n",
"|[{Header, An inli...|\n",
"+--------------------+\n",
"+--------------------+--------------------+\n",
"| path| doc|\n",
"+--------------------+--------------------+\n",
"|file:/content/wor...|[{Header, An inli...|\n",
"|file:/content/wor...|[{Table, Header C...|\n",
"+--------------------+--------------------+\n",
"\n"
]
}
],
"source": [
"doc_df.select(\"doc\").show()"
"doc_df.show()"
]
},
{
Expand All @@ -198,7 +208,7 @@
"base_uri": "https://localhost:8080/"
},
"id": "IoC1eqPPcmqN",
"outputId": "b994396c-b670-49af-8bb9-b5e6ff44e8fe"
"outputId": "73acbe65-0844-446a-f59a-6549dddfdd47"
},
"outputs": [
{
Expand All @@ -207,7 +217,6 @@
"text": [
"root\n",
" |-- path: string (nullable = true)\n",
" |-- content: binary (nullable = true)\n",
" |-- doc: array (nullable = true)\n",
" | |-- element: struct (containsNull = true)\n",
" | | |-- elementType: string (nullable = true)\n",
Expand All @@ -234,6 +243,56 @@
"- HDFS: `hdfs://`\n",
"- Microsoft Fabric OneLake: `abfss://`"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "1DHIwRe13Ko7"
},
"source": [
"### Configuration Parameters"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FFnRYtys3Tv6"
},
"source": [
"- `storeContent`: By default, this is set to `false`. When enabled, the output will include the byte content of the file."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "EY9qzmZu3NC8",
"outputId": "0d0916b1-b0ca-4c58-b723-dcad794cd3e3"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Warning::Spark Session already created, some configs may not take.\n",
"+--------------------+--------------------+--------------------+\n",
"| path| doc| content|\n",
"+--------------------+--------------------+--------------------+\n",
"|file:/content/wor...|[{Header, An inli...|[50 4B 03 04 14 0...|\n",
"|file:/content/wor...|[{Table, Header C...|[50 4B 03 04 14 0...|\n",
"+--------------------+--------------------+--------------------+\n",
"\n"
]
}
],
"source": [
"params = {\"storeContent\": \"true\"}\n",
"doc_df = sparknlp.read(params).doc(\"./word-files\")\n",
"doc_df.show()"
]
}
],
"metadata": {
Expand Down
Loading