This project is a fully automated end-to-end Data Engineering pipeline built as part of the Data Engineering Zoomcamp. It ingests, processes, models, and visualizes airplane crash data using modern cloud-native tools.
The objective of this pipeline is to build a reliable and scalable data platform that answers key analytical questions related to global airplane crashes. It helps track crash frequency, operator involvement, survival rates, aircraft models, and trends over time.
- Source: Hugging Face dataset: nateraw/airplane-crashes-and-fatalities
- Original Source: Likely web-scraped from PlaneCrashInfo.com
- Format: CSV
- Fields:
Date
,Location
,Operator
,Type
(aircraft),Aboard
,Fatalities
,Ground
,Survivors
,Summary
The raw file is available as:
s3://my-spark-stage-23-3-1998-v1-01/plane_crashes/raw_hf_airplane_crashes.csv
- Extracts raw CSV airplane crash data
- Transforms it with Spark (adds severity scoring, survival count, and normalizes formats)
- Stores it partitioned by year in Parquet format on Amazon S3
- Loads the processed data as an external Redshift Spectrum table
- Models it into fact/dimension structure via dbt
- Visualizes KPIs and trends in Amazon QuickSight
- Entire flow orchestrated using Prefect, from infrastructure provisioning to final model run
- Runs the full pipeline using a single command (
pipeline.py
) - Tasks include: Terraform apply, pulling data, Spark job on EMR, Redshift schema/table creation, dbt model runs, and test jobs
- Creates S3 bucket, EMR cluster, Redshift cluster, IAM roles
- Cleans and transforms the raw dataset:
- Standardizes date format to
crash_date
- Adds fields:
survivors
,is_fatal
,crash_severity
- Partitions data by
year
- Standardizes date format to
- Table created over the Parquet files
- Partitions dynamically added based on folder paths
- Models split into:
fact_plane_crashes
dim_aircraft
,dim_operator
,dim_date
- Connects directly to Redshift
- Shows crash trends, survival rates, worst aircraft types, etc.
Tool | Role |
---|---|
Terraform | Infra provisioning (S3, EMR, Redshift, IAM) |
Spark on EMR | Batch processing and transformation |
AWS S3 | Raw and processed data lake storage |
Redshift Spectrum | External table queries on S3 data |
dbt | Data modeling into fact/dim tables |
Prefect | Workflow orchestration and automation |
QuickSight | Business intelligence and dashboards |
git clone https://github.com/YOUR_USERNAME/aircrash-data-pipeline.git
cd aircrash-data-pipeline
aws configure
# Add your AWS Access Key, Secret, Region (e.g., us-west-2)
cp .env.example .env
# Edit it and fill in your Redshift password, bucket, etc.
prefect deploy
prefect agent start &
python perfect/pipeline.py
.
├── terraform/ # Infra as code (S3, EMR, Redshift)
├── scripts/ # Python scripts (Spark, data ingestion)
├── aircrash_dwh/ # dbt project (models, seeds, macros)
├── perfect/ # Prefect orchestration pipeline
├── docs/images/ # Architecture + dashboard screenshots
├── .env.example # Example env vars for local secrets
└── README.md
generate_profiles.py
auto-creates~/.dbt/profiles.yml
using Terraform output.env
keeps secrets secure- Orchestration covers the entire flow — no manual steps
Be sure your .gitignore
excludes:
.terraform/
*.tfstate
*.pem
.env
*.pyc
__pycache__/
tf_outputs.json
dbt_packages/
target/
Project built by Hossam as part of the DataTalksClub Data Engineering Zoomcamp.