✈️ Aircrash Data Pipeline (Public - Zoomcamp Final Project)

This project is a fully automated end-to-end Data Engineering pipeline built as part of the Data Engineering Zoomcamp. It ingests, processes, models, and visualizes airplane crash data using modern cloud-native tools.

📊 Project Goal

The objective of this pipeline is to build a reliable and scalable data platform that answers key analytical questions related to global airplane crashes. It helps track crash frequency, operator involvement, survival rates, aircraft models, and trends over time.

📊 Data Source

Source: Hugging Face dataset: nateraw/airplane-crashes-and-fatalities
Original Source: Likely web-scraped from PlaneCrashInfo.com
Format: CSV
Fields:
- Date, Location, Operator, Type (aircraft), Aboard, Fatalities, Ground, Survivors, Summary

The raw file is available as: s3://my-spark-stage-23-3-1998-v1-01/plane_crashes/raw_hf_airplane_crashes.csv

✅ What This Pipeline Solves

Extracts raw CSV airplane crash data
Transforms it with Spark (adds severity scoring, survival count, and normalizes formats)
Stores it partitioned by year in Parquet format on Amazon S3
Loads the processed data as an external Redshift Spectrum table
Models it into fact/dimension structure via dbt
Visualizes KPIs and trends in Amazon QuickSight
Entire flow orchestrated using Prefect, from infrastructure provisioning to final model run

📊 Architecture Diagram

🚀 Pipeline Flow (Automated A to Z)

🪄 Orchestration (Prefect)

Runs the full pipeline using a single command (pipeline.py)
Tasks include: Terraform apply, pulling data, Spark job on EMR, Redshift schema/table creation, dbt model runs, and test jobs

🚧 Infrastructure (Terraform)

Creates S3 bucket, EMR cluster, Redshift cluster, IAM roles

✨ Data Processing (Spark on EMR)

Cleans and transforms the raw dataset:
- Standardizes date format to crash_date
- Adds fields: survivors, is_fatal, crash_severity
- Partitions data by year

📑 Data Warehouse (Redshift Spectrum)

Table created over the Parquet files
Partitions dynamically added based on folder paths

📊 Data Modeling (dbt)

Models split into:
- fact_plane_crashes
- dim_aircraft, dim_operator, dim_date

🔍 BI Dashboard (QuickSight)

Connects directly to Redshift
Shows crash trends, survival rates, worst aircraft types, etc.

📊 Dashboard Example

🤹 Tech Stack Summary

Tool	Role
Terraform	Infra provisioning (S3, EMR, Redshift, IAM)
Spark on EMR	Batch processing and transformation
AWS S3	Raw and processed data lake storage
Redshift Spectrum	External table queries on S3 data
dbt	Data modeling into fact/dim tables
Prefect	Workflow orchestration and automation
QuickSight	Business intelligence and dashboards

🔧 Setup Instructions

1. Clone the Repo

git clone https://github.com/YOUR_USERNAME/aircrash-data-pipeline.git
cd aircrash-data-pipeline

2. Configure AWS Credentials

aws configure
# Add your AWS Access Key, Secret, Region (e.g., us-west-2)

3. Create `.env` File

cp .env.example .env
# Edit it and fill in your Redshift password, bucket, etc.

4. Run the Pipeline (Fully Automated)

prefect deploy
prefect agent start &
python perfect/pipeline.py

📂 Repo Structure

.
├── terraform/              # Infra as code (S3, EMR, Redshift)
├── scripts/                # Python scripts (Spark, data ingestion)
├── aircrash_dwh/           # dbt project (models, seeds, macros)
├── perfect/                # Prefect orchestration pipeline
├── docs/images/            # Architecture + dashboard screenshots
├── .env.example            # Example env vars for local secrets
└── README.md

⚡ Notes

generate_profiles.py auto-creates ~/.dbt/profiles.yml using Terraform output
.env keeps secrets secure
Orchestration covers the entire flow — no manual steps

🚫 What Not to Commit

Be sure your .gitignore excludes:

.terraform/
*.tfstate
*.pem
.env
*.pyc
__pycache__/
tf_outputs.json
dbt_packages/
target/

🎓 Credit

Project built by Hossam as part of the DataTalksClub Data Engineering Zoomcamp.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

✈️ Aircrash Data Pipeline (Public - Zoomcamp Final Project)

📊 Project Goal

📊 Data Source

✅ What This Pipeline Solves

📊 Architecture Diagram

🚀 Pipeline Flow (Automated A to Z)

🪄 Orchestration (Prefect)

🚧 Infrastructure (Terraform)

✨ Data Processing (Spark on EMR)

📑 Data Warehouse (Redshift Spectrum)

📊 Data Modeling (dbt)

🔍 BI Dashboard (QuickSight)

📊 Dashboard Example

🤹 Tech Stack Summary

🔧 Setup Instructions

1. Clone the Repo

2. Configure AWS Credentials

3. Create `.env` File

4. Run the Pipeline (Fully Automated)

📂 Repo Structure

⚡ Notes

🚫 What Not to Commit

🎓 Credit

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
aircrash_dwh		aircrash_dwh
perfect		perfect
scripts		scripts
terraform		terraform
.env.example		.env.example
.gitignore		.gitignore
DE Zoomcamp.pdf		DE Zoomcamp.pdf
README.md		README.md
typescript		typescript

HossamDC/aircrash-data-pipeline-de

Folders and files

Latest commit

History

Repository files navigation

✈️ Aircrash Data Pipeline (Public - Zoomcamp Final Project)

📊 Project Goal

📊 Data Source

✅ What This Pipeline Solves

📊 Architecture Diagram

🚀 Pipeline Flow (Automated A to Z)

🪄 Orchestration (Prefect)

🚧 Infrastructure (Terraform)

✨ Data Processing (Spark on EMR)

📑 Data Warehouse (Redshift Spectrum)

📊 Data Modeling (dbt)

🔍 BI Dashboard (QuickSight)

📊 Dashboard Example

🤹 Tech Stack Summary

🔧 Setup Instructions

1. Clone the Repo

2. Configure AWS Credentials

3. Create .env File

4. Run the Pipeline (Fully Automated)

📂 Repo Structure

⚡ Notes

🚫 What Not to Commit

🎓 Credit

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

3. Create `.env` File

Packages