MovieLens Dataset analysis using Hadoop and Pyspark
-
Install Jupyter notebook
pip install jupyter
Now just write
jupyter notebook
in your command prompt and you will see a notebook opening in your localhost -
Install Java
Download Java and install it in your computer
Add Java to the path
Go to Program files > Java > jdk > bin
Copy the path
Go to environment variables and paste this in User variables
"Path"
-
Setup Java
Add "JAVA_HOME" variable to environment variables
Go to Program files > Java > jdk
Copy the path and paste it in JAVA_HOME variable
-
Setup Hadoop
Add
"HADOOP_HOME"
to environment variablesIn the git repo there is hadoop folder
Copy the link to that folder
Add it to HADOOP_HOME variable
-
Setup Spark
Add
"SPARK_HOME"
to environment variablesIn the git repo there is spark zip
Unzip that
Copy the link to that folder
Add it to
SPARK_HOME
variable -
Setup Pyspark
Setting up Pyspark variables
Go to environment variables and add these two
PYSPARK_DRIVER_PYTHON
with valuejupyter
PYSPARK_DRIVER_PYTHON_OPTS
with valuenotebook
-
Final Path setup
Go to Path in environment variables and add
%SPARK_HOME%\bin
%HADOOP_HOME%\bin
and you will see Pyspark is setup and a new jupyter notebook will open with it
Some references to help you setup