Skip to content

A scalable Spark framework for versatile data analysis

Notifications You must be signed in to change notification settings

tanle8/spbd-nyc-da

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NYC Data Analysis using Spark and GCP

Environment Setup

Create a conda env

conda create --name data_analysis_spark

Activate the environment

conda activate data_analysis_spark

Install PySpark

conda install -c conda-forge pyspark

Install Jupyter Notebook

conda install -c conda-forge notebook

Install necessary packages

pip install -r requirements.txt

Install GCP CLI (Optional)

  1. Install the Google Cloud SDK:

    • Download and install the Google Cloud SDK from the Google Cloud SDK page. It includes the gcloud CLI.
    • Follow the installation instructions for your specific operating system.
  2. Initialize the gcloud CLI:

  • After installation, open a terminal or command prompt.

  • Run the initialization command:

    gcloud init
  • Follow the on-screen instructions to authenticate your Google account and set up the default configuration, including the project and compute zone.

Google Cloud Dataproc

We need to setting up Google Cloud Dataproc (a managed Spark and Hadoop service) to executing Spark jobs on it.

We can do it using its Web UI or gcloud CLI

The dataset

The dataset can be found through this link: https://www.kaggle.com/c/nyc-taxi-trip-duration/data

https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

Run all the jobs

Give the script the permission to run

chmod +x run_workflow.sh

then run it:

./run_workflow.sh

About

A scalable Spark framework for versatile data analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published