GitHub - Vivek-Murali/CarCrashAnalysis: BCG GAMMA CASE STUDY

Car Carsh Analysis {BCG GAMMA Case Study}

Develop a modular application uses spark-submit to provide results for given tasks.

📝 Table of Contents

📝 Table of Contents
🧐 About
✍️ Project File Structure
🏁 Getting Started
- Prerequisites
- Installing
🎈 Usage
⛏️ Built Using

🧐 About

Use 6 csv files in the raw_folder and develop your approach to perform below analytics.

Requirements

Application should perform below analysis and store the results for each analysis in the destination folder.

Analytics 1: Find the number of crashes (accidents) in which number of persons killed are male?
Analysis 2: How many two wheelers are booked for crashes?
Analysis 3: Which state has highest number of accidents in which females are involved?
Analysis 4: Which are the Top 5th to 15th VEH_MAKE_IDs that contribute to a largest number of injuries including death.
Analysis 5: For all the body styles involved in crashes, mention the top ethnic user group of each unique body style.
Analysis 6: Among the crashed cars, what are the Top 5 Zip Codes with highest number crashes with alcohols as the contributing factor to a crash (Use Driver Zip Code).
Analysis 7: Count of Distinct Crash IDs where No Damaged Property was observed and Damage Level (VEH_DMAG_SCL~) is above 4 and car avails Insurance.
Analysis 8: Determine the Top 5 Vehicle Makes where drivers are charged with speeding related offences, has licensed Drivers, used top 10 used vehicle colours and has car licensed with the Top 25 states with highest number of offences (to be deduced from the data).

✍️ Project File Structure

The basic project structure is shown as below:

CarCrashAnalysis-BCG
├─ .gitignore
├─ README.md
├─ config
│  └─ config.json
├─ jobs
│  ├─ __init__.py
│  ├─ job.py
│  ├─ jobbuilder.py
│  ├─ loader.py
│  ├─ logger.py
│  ├─ settings.py
│  └─ utils.py
├─ notebooks
│  └─ workbook.ipynb
├─ requirements.txt
├─ resources
│  ├─ logs
│  ├─ processed
│  │  ├─ Question_1
│  │  │  ├─ ._SUCCESS.crc
│  │  │  ├─ .part-00000-cfae5228-be9e-456b-82a7-1a8be3e75b47-c000.csv.crc
│  │  │  ├─ _SUCCESS
│  │  │  └─ part-00000-cfae5228-be9e-456b-82a7-1a8be3e75b47-c000.csv
│  │  ├─ Question_2
│  │  │  ├─ ._SUCCESS.crc
│  │  │  ├─ .part-00000-f02dac60-a57f-4384-a945-87965281456b-c000.csv.crc
│  │  │  ├─ _SUCCESS
│  │  │  └─ part-00000-f02dac60-a57f-4384-a945-87965281456b-c000.csv
│  │  ├─ Question_3
│  │  │  ├─ ._SUCCESS.crc
│  │  │  ├─ .part-00000-cad9913b-e1eb-4006-b8b2-dce7572f4a7e-c000.csv.crc
│  │  │  ├─ _SUCCESS
│  │  │  └─ part-00000-cad9913b-e1eb-4006-b8b2-dce7572f4a7e-c000.csv
│  │  ├─ Question_4
│  │  │  ├─ ._SUCCESS.crc
│  │  │  ├─ .part-00000-fef5dd18-6b02-4ad9-9818-1b56d4cb72e2-c000.csv.crc
│  │  │  ├─ _SUCCESS
│  │  │  └─ part-00000-fef5dd18-6b02-4ad9-9818-1b56d4cb72e2-c000.csv
│  │  ├─ Question_5
│  │  │  ├─ ._SUCCESS.crc
│  │  │  ├─ .part-00000-bb2305a2-89b6-4244-a48c-078be6506df1-c000.csv.crc
│  │  │  ├─ _SUCCESS
│  │  │  └─ part-00000-bb2305a2-89b6-4244-a48c-078be6506df1-c000.csv
│  │  ├─ Question_6
│  │  │  ├─ ._SUCCESS.crc
│  │  │  ├─ .part-00000-7c3fae01-564b-4929-a972-ea92a2aca3ef-c000.csv.crc
│  │  │  ├─ _SUCCESS
│  │  │  └─ part-00000-7c3fae01-564b-4929-a972-ea92a2aca3ef-c000.csv
│  │  ├─ Question_7
│  │  │  ├─ ._SUCCESS.crc
│  │  │  ├─ .part-00000-9292f6e7-c9fc-4314-9521-16cf969ee0da-c000.csv.crc
│  │  │  ├─ _SUCCESS
│  │  │  └─ part-00000-9292f6e7-c9fc-4314-9521-16cf969ee0da-c000.csv
│  │  └─ Question_8
│  │     ├─ ._SUCCESS.crc
│  │     ├─ .part-00000-c7f26cb6-920e-478e-a2de-a5cded5eef32-c000.csv.crc
│  │     ├─ _SUCCESS
│  │     └─ part-00000-c7f26cb6-920e-478e-a2de-a5cded5eef32-c000.csv
│  └─ raw
│     ├─ Charges_use.csv
│     ├─ Damages_use.csv
│     ├─ Endorse_use.csv
│     ├─ Primary_Person_use.csv
│     ├─ Restrict_use.csv
│     └─ Units_use.csv
├─ runner.py
└─ tests
   └─ __init__.py

🏁 Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Prerequisites

What things you need to run the project?

Download and Install SPARK. Find the latest release from: Spark Download

Configure SPARK_HOME Environment Variable.

$SPARK_HOME = {Loaction of spark-3.3.1-bin-hadoop3 Directory}

Check Spark is installed properly or not by running.

spark-shell

Installing

Clone the Github Repo with the URL.

git clone https://github.com/Vivek-Murali/CarCrashAnalysis-BCG

🎈 Usage

Change the values in the config.json file which is found in config directory and the key defination are as follows.

resource(Data resource)-> source_path(Input directory location to read all the CSVs from).
resource(Data resource)-> destination_path(Output directory location to write individual CSV to).
variables(ENV Variables)-> APPDIR(Project Parent directory PATH)
functions(Dependent Variables) -> question_id(Analysis Question Identifier are associated with analysis question number mentioned above (i.e question_id:1 here 1 represents analysis 1 which is to find the number of crashes (accidents) in which number of persons killed are male)).
functions(Dependent Variables) -> mode(Defines the spark write mode of the application. (i.e. overwrite, append, Ignore, ErrorIfExists))
version(version of the config file)
app_name(Defines the spark application name) can be anything related to the project.

config.json, The config file looks like below:

  {
    "resource":{
        "source_path":"resources/raw",
        "destination_path":"resources/processed"
    },
    "variables":{
        "APPDIR":"/home/sharpnel/Documents/CarCrashAnalysis-BCG"
    },
    "functions":{
        "question_id":8,
        "mode": "overwrite"
    },
    "version":"v1.2.0",
    "app_name":"analytics"
}

Run the application by using the following command.

spark-submit --master local[*] runner.py --config config/config.json

Note: runner.py is the main runner file invokes the jobbuilder class to build and execute analysis based on question_id mentioned in config.json file.

You can find the rough version of the analysis in the workbook.ipynb file in the notebook directory.

⛏️ Built Using

Pyspark - Data Processing Framework
Seaborn - Data Visualization Library
Matplotlib - Data Visualization Library
Pandas - Data Analysis Library
Jupyter Notebook - Data Analysis Tool

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config

config

jobs

jobs

notebooks

notebooks

resources

resources

tests

tests

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

runner.py

runner.py

Repository files navigation

Car Carsh Analysis {BCG GAMMA Case Study}

📝 Table of Contents

🧐 About

Use 6 csv files in the raw_folder and develop your approach to perform below analytics.

Requirements

✍️ Project File Structure

🏁 Getting Started

Prerequisites

Installing

🎈 Usage

⛏️ Built Using

About

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
config		config
jobs		jobs
notebooks		notebooks
resources		resources
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
runner.py		runner.py

Vivek-Murali/CarCrashAnalysis

Folders and files

Latest commit

History

Repository files navigation

Car Carsh Analysis {BCG GAMMA Case Study}

📝 Table of Contents

🧐 About

Use 6 csv files in the raw_folder and develop your approach to perform below analytics.

Requirements

✍️ Project File Structure

🏁 Getting Started

Prerequisites

Installing

🎈 Usage

⛏️ Built Using

About

Topics

Resources

Stars

Watchers

Forks

Languages