NYC Data Analysis using Spark and GCP

Environment Setup

Create a conda env

conda create --name data_analysis_spark

Activate the environment

conda activate data_analysis_spark

Install PySpark

conda install -c conda-forge pyspark

Install Jupyter Notebook

conda install -c conda-forge notebook

Install necessary packages

pip install -r requirements.txt

Install GCP CLI (Optional)

Install the Google Cloud SDK:
- Download and install the Google Cloud SDK from the Google Cloud SDK page. It includes the gcloud CLI.
- Follow the installation instructions for your specific operating system.
Initialize the gcloud CLI:

After installation, open a terminal or command prompt.
Run the initialization command:
```
gcloud init
```
Follow the on-screen instructions to authenticate your Google account and set up the default configuration, including the project and compute zone.

Google Cloud Dataproc

We need to setting up Google Cloud Dataproc (a managed Spark and Hadoop service) to executing Spark jobs on it.

We can do it using its Web UI or gcloud CLI

The dataset

The dataset can be found through this link: https://www.kaggle.com/c/nyc-taxi-trip-duration/data

https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Run all the jobs

Give the script the permission to run

chmod +x run_workflow.sh

then run it:

./run_workflow.sh

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.vscode		.vscode
config		config
results		results
src/jobs		src/jobs
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run_workflow.sh		run_workflow.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NYC Data Analysis using Spark and GCP

Environment Setup

Create a conda env

Activate the environment

Install PySpark

Install Jupyter Notebook

Install necessary packages

Install GCP CLI (Optional)

Google Cloud Dataproc

The dataset

Run all the jobs

About

Releases

Packages

Languages

tanle8/spbd-nyc-da

Folders and files

Latest commit

History

Repository files navigation

NYC Data Analysis using Spark and GCP

Environment Setup

Create a conda env

Activate the environment

Install PySpark

Install Jupyter Notebook

Install necessary packages

Install GCP CLI (Optional)

Google Cloud Dataproc

The dataset

Run all the jobs

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages