conda create --name data_analysis_spark
conda activate data_analysis_spark
conda install -c conda-forge pyspark
conda install -c conda-forge notebook
pip install -r requirements.txt
-
Install the Google Cloud SDK:
- Download and install the Google Cloud SDK from the Google Cloud SDK page. It includes the gcloud CLI.
- Follow the installation instructions for your specific operating system.
-
Initialize the
gcloud
CLI:
-
After installation, open a terminal or command prompt.
-
Run the initialization command:
gcloud init
-
Follow the on-screen instructions to authenticate your Google account and set up the default configuration, including the project and compute zone.
We need to setting up Google Cloud Dataproc (a managed Spark and Hadoop service) to executing Spark jobs on it.
We can do it using its Web UI or gcloud
CLI
The dataset can be found through this link: https://www.kaggle.com/c/nyc-taxi-trip-duration/data
https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
Give the script the permission to run
chmod +x run_workflow.sh
then run it:
./run_workflow.sh