How to run the notebook locally:
- Python 3
- Pyspark
- Apache Hadoop
- Apache Sedonna
-
Download and install Python. We will be using Python 3.11.9 for compatibility purposes. Make sure to add python to Path.
-
Open command prompt and install pyspark with pip:
pip install pyspark
Download and extract the pyspark distribution from the Apache Spark website (we will be using pyspark-3.5.4 with hadoop 3).
- After extracting the file, create the SPARK_HOME system environment and add the path\to\pyspark-3.5.4-bin-hadoop3.
- Add path\to\pyspark-3.5.4-bin-hadoop3\bin to Path.
-
Download Apache Hadoop and extract its files to your prefered path (we use version 3.3.6) In the System Environment Variables make the following additions:
- Add path\to\hadoop\bin to Path
- Create a variable HADOOP_HOME with value path\to\hadoop
-
Install Apache Sedona using pip:
pip install apache-sedona
We will need to download some files for it to work correctly:
We will add both of these files to path\to\spark-3.5.4-bin-hadoop3\jars