-
Notifications
You must be signed in to change notification settings - Fork 722
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARKNLP-1093] Adding support to read Email files (#14455)
* [SPARKNLP-1089] Adding support to read HTML files * [SPARKNLP-1089] Adding documentation and support for set of URLs in python * [SPARKNLP-1089] Adding input validation in python * [SPARKNLP-1093] Adding support to read email files * [SPARKNLP-1093] Adding reading email notebook example
- Loading branch information
Showing
23 changed files
with
33,559 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
334 changes: 334 additions & 0 deletions
334
examples/python/reader/SparkNLP_Email_Reader_Demo.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,334 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "tzcU5p2gdak9" | ||
}, | ||
"source": [ | ||
"# Introducing Email reader in SparkNLP\n", | ||
"This notebook showcases the newly added `sparknlp.read().email()` method in Spark NLP that parses email content from both local file system and distributed file systems into a Spark DataFrame." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"source": [ | ||
"from google.colab import drive\n", | ||
"drive.mount('/content/drive')" | ||
], | ||
"metadata": { | ||
"colab": { | ||
"base_uri": "https://localhost:8080/" | ||
}, | ||
"id": "xrvHhiTAdfGd", | ||
"outputId": "07fb7294-33b3-4af0-f4ac-d87e43fd21b6" | ||
}, | ||
"execution_count": 1, | ||
"outputs": [ | ||
{ | ||
"output_type": "stream", | ||
"name": "stdout", | ||
"text": [ | ||
"Mounted at /content/drive\n" | ||
] | ||
} | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"source": [ | ||
"!cp drive/MyDrive/JSL/sparknlp/sparknlp.jar .\n", | ||
"!cp drive/MyDrive/JSL/sparknlp/spark_nlp-5.5.1-py2.py3-none-any.whl ." | ||
], | ||
"metadata": { | ||
"id": "mjV3NcQ8eA52" | ||
}, | ||
"execution_count": 8, | ||
"outputs": [] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"source": [ | ||
"!pip install pyspark" | ||
], | ||
"metadata": { | ||
"colab": { | ||
"base_uri": "https://localhost:8080/" | ||
}, | ||
"id": "pEmutNjReCgc", | ||
"outputId": "32610063-174f-432b-be4a-6ab2ae9dd709" | ||
}, | ||
"execution_count": 3, | ||
"outputs": [ | ||
{ | ||
"output_type": "stream", | ||
"name": "stdout", | ||
"text": [ | ||
"Requirement already satisfied: pyspark in /usr/local/lib/python3.10/dist-packages (3.5.3)\n", | ||
"Requirement already satisfied: py4j==0.10.9.7 in /usr/local/lib/python3.10/dist-packages (from pyspark) (0.10.9.7)\n" | ||
] | ||
} | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"source": [ | ||
"!pip install spark_nlp-5.5.1-py2.py3-none-any.whl" | ||
], | ||
"metadata": { | ||
"colab": { | ||
"base_uri": "https://localhost:8080/" | ||
}, | ||
"id": "3qjPeDjvfCpA", | ||
"outputId": "620c793f-5cb1-4a82-f687-53f3be348d9c" | ||
}, | ||
"execution_count": 9, | ||
"outputs": [ | ||
{ | ||
"output_type": "stream", | ||
"name": "stdout", | ||
"text": [ | ||
"Processing ./spark_nlp-5.5.1-py2.py3-none-any.whl\n", | ||
"Installing collected packages: spark-nlp\n", | ||
"Successfully installed spark-nlp-5.5.1\n" | ||
] | ||
} | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"source": [ | ||
"# import sparknlp\n", | ||
"# # let's start Spark with Spark NLP\n", | ||
"# spark = sparknlp.start()\n", | ||
"\n", | ||
"from pyspark.sql import SparkSession\n", | ||
"\n", | ||
"spark = SparkSession.builder \\\n", | ||
" .appName(\"SparkNLP\") \\\n", | ||
" .master(\"local[*]\") \\\n", | ||
" .config(\"spark.driver.memory\", \"12G\") \\\n", | ||
" .config(\"spark.serializer\", \"org.apache.spark.serializer.KryoSerializer\") \\\n", | ||
" .config(\"spark.kryoserializer.buffer.max\", \"2000M\") \\\n", | ||
" .config(\"spark.driver.maxResultSize\", \"0\") \\\n", | ||
" .config(\"spark.jars\", \"./sparknlp.jar\") \\\n", | ||
" .getOrCreate()\n", | ||
"\n", | ||
"\n", | ||
"print(\"Apache Spark version: {}\".format(spark.version))" | ||
], | ||
"metadata": { | ||
"colab": { | ||
"base_uri": "https://localhost:8080/" | ||
}, | ||
"id": "DczWop6QeE8F", | ||
"outputId": "714b032f-e076-4aa3-8cf2-10eea6993c4d" | ||
}, | ||
"execution_count": 5, | ||
"outputs": [ | ||
{ | ||
"output_type": "stream", | ||
"name": "stdout", | ||
"text": [ | ||
"Apache Spark version: 3.5.3\n" | ||
] | ||
} | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "RFOFhaEedalB" | ||
}, | ||
"source": [ | ||
"## Setup and Initialization\n", | ||
"Let's keep in mind a few things before we start 😊\n", | ||
"\n", | ||
"Support for reading email files was introduced in Spark NLP 5.5.2. Please make sure you have upgraded to the latest Spark NLP release.\n", | ||
"\n", | ||
"For local files example we will download a couple of email files from Spark NLP Github repo:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 17, | ||
"metadata": { | ||
"colab": { | ||
"base_uri": "https://localhost:8080/" | ||
}, | ||
"id": "ya8qZe00dalC", | ||
"outputId": "a9916407-f76d-4c59-fdad-ea17ca0a4326" | ||
}, | ||
"outputs": [ | ||
{ | ||
"output_type": "stream", | ||
"name": "stdout", | ||
"text": [ | ||
"mkdir: cannot create directory ‘email-files’: File exists\n", | ||
"--2024-11-13 21:01:15-- https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1093-Adding-support-to-read-Email-files/src/test/resources/reader/email/email-text-attachments.eml\n", | ||
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n", | ||
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n", | ||
"HTTP request sent, awaiting response... 200 OK\n", | ||
"Length: 3175 (3.1K) [text/plain]\n", | ||
"Saving to: ‘email-files/email-text-attachments.eml’\n", | ||
"\n", | ||
"email-text-attachme 100%[===================>] 3.10K --.-KB/s in 0s \n", | ||
"\n", | ||
"2024-11-13 21:01:15 (29.9 MB/s) - ‘email-files/email-text-attachments.eml’ saved [3175/3175]\n", | ||
"\n", | ||
"--2024-11-13 21:01:15-- https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1093-Adding-support-to-read-Email-files/src/test/resources/reader/email/test-several-attachments.eml\n", | ||
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n", | ||
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n", | ||
"HTTP request sent, awaiting response... 200 OK\n", | ||
"Length: 1324361 (1.3M) [text/plain]\n", | ||
"Saving to: ‘email-files/test-several-attachments.eml’\n", | ||
"\n", | ||
"test-several-attach 100%[===================>] 1.26M --.-KB/s in 0.05s \n", | ||
"\n", | ||
"2024-11-13 21:01:16 (26.7 MB/s) - ‘email-files/test-several-attachments.eml’ saved [1324361/1324361]\n", | ||
"\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"!mkdir email-files\n", | ||
"!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1093-Adding-support-to-read-Email-files/src/test/resources/reader/email/email-text-attachments.eml -P email-files\n", | ||
"!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1093-Adding-support-to-read-Email-files/src/test/resources/reader/email/test-several-attachments.eml -P email-files" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"source": [ | ||
"!ls -lh ./email-files" | ||
], | ||
"metadata": { | ||
"colab": { | ||
"base_uri": "https://localhost:8080/" | ||
}, | ||
"id": "3xgGItNbU2DZ", | ||
"outputId": "12f8a7be-f9b4-49ce-a9ab-222142f28293" | ||
}, | ||
"execution_count": 18, | ||
"outputs": [ | ||
{ | ||
"output_type": "stream", | ||
"name": "stdout", | ||
"text": [ | ||
"total 1.3M\n", | ||
"-rw-r--r-- 1 root root 3.2K Nov 13 21:01 email-text-attachments.eml\n", | ||
"-rw-r--r-- 1 root root 1.3M Nov 13 21:01 test-several-attachments.eml\n" | ||
] | ||
} | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "EoFI66NAdalE" | ||
}, | ||
"source": [ | ||
"## Parsing Email from Local Files\n", | ||
"Use the `email()` method to parse email content from local directories." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 22, | ||
"metadata": { | ||
"colab": { | ||
"base_uri": "https://localhost:8080/" | ||
}, | ||
"id": "bAkMjJ1vdalE", | ||
"outputId": "4b360b6c-5049-4f10-bb52-60e0e0e52e52" | ||
}, | ||
"outputs": [ | ||
{ | ||
"output_type": "stream", | ||
"name": "stdout", | ||
"text": [ | ||
"Warning::Spark Session already created, some configs may not take.\n", | ||
"+--------------------+\n", | ||
"| email|\n", | ||
"+--------------------+\n", | ||
"|[{Title, Email Te...|\n", | ||
"|[{Title, Test Sev...|\n", | ||
"+--------------------+\n", | ||
"\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import sparknlp\n", | ||
"email_df = sparknlp.read().email(\"./email-files\")\n", | ||
"\n", | ||
"email_df.select(\"email\").show()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"source": [ | ||
"email_df.printSchema()" | ||
], | ||
"metadata": { | ||
"colab": { | ||
"base_uri": "https://localhost:8080/" | ||
}, | ||
"id": "7CMPPubFTeHj", | ||
"outputId": "48ee68cf-0f7f-408a-a855-2fd2eb2e8bd1" | ||
}, | ||
"execution_count": 21, | ||
"outputs": [ | ||
{ | ||
"output_type": "stream", | ||
"name": "stdout", | ||
"text": [ | ||
"root\n", | ||
" |-- path: string (nullable = true)\n", | ||
" |-- content: binary (nullable = true)\n", | ||
" |-- email: array (nullable = true)\n", | ||
" | |-- element: struct (containsNull = true)\n", | ||
" | | |-- elementType: string (nullable = true)\n", | ||
" | | |-- content: string (nullable = true)\n", | ||
" | | |-- metadata: map (nullable = true)\n", | ||
" | | | |-- key: string\n", | ||
" | | | |-- value: string (valueContainsNull = true)\n", | ||
"\n" | ||
] | ||
} | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"source": [ | ||
"You can also use DFS like Databricks `dbfs://` or HDFS directories `hdfs://`" | ||
], | ||
"metadata": { | ||
"id": "Qooecm9VTeus" | ||
} | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.12" | ||
}, | ||
"colab": { | ||
"provenance": [] | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 0 | ||
} |
Oops, something went wrong.