UFO Sighting ELT Pipeline

About • Installation • Dashboard • ELT Diagram • Improvements

About

This project showcases an Extract, Load, Transform (ELT) pipeline built with Python, Apache Spark, Delta Lake, and Docker. The objective of the project is to scrape UFO sighting data from the National UFO Reporting Center (NUFORC) and process it through the Medallion architecture to create a star schema in the Gold layer.

The pipeline begins by utilizing Python for web scraping, extracting UFO sighting data from NUFORC. The scraped data is then transformed using Apache Spark, a powerful big data processing framework. Spark enables efficient data manipulation, cleansing, and aggregation tasks on the extracted data.

To ensure reliability and scalability, the data is stored in Delta Lake, an open-source storage layer built on top of Apache Parquet and Apache Spark. Delta Lake provides ACID transactions, schema enforcement, and versioning capabilities, making it ideal for data pipeline workflows.

The project is containerized using Docker, allowing for easy deployment and reproducibility across different environments. Docker enables seamless packaging and distribution of the entire pipeline, ensuring consistent execution and dependency management.

The result is a well-organized ELT pipeline that follows the Medallion architecture principles, with Bronze, Silver, and Gold layers. The Bronze layer contains the raw, unprocessed data. The Silver layer represents the transformed and cleansed data, while the Gold layer consists of a star schema, enabling efficient querying and analysis.

ELT Diagram

Key Technologies:

Python: Web scraping and scripting language
Apache Spark: Big data processing and transformation framework
Delta Lake: Data storage layer with ACID transactions and versioning
Docker: Containerization platform for easy deployment and reproducibility
Tableau: Visual analytics platform

Application services at runtime:

One spark driver
One spark master
Two spark worker nodes
Spark History Server
Jupyter Lab

Installation

Download Docker Desktop and start docker
Clone Repo

git clone https://github.com/jgrove90/ufo-deltalake.git

Run start.sh to start the spark application NOTE: You may need to adjust the resources allocated to the master and worker nodes to match your system resources. These settings can be found in: ./src/spark/spark-defaults.conf

sh start.sh

Access application services via the web browser
- Spark Master UI - http://localhost:7070/
- Spark History Server - http://localhost:18080/
- Jupyter Lab - http://localhost:8888/
Run teardown.sh to remove application from system including docker images

sh teardown.sh

Improvements

This project is over engineered as Apache Spark is best suited for big data but I wanted to explore the use of Delta Lake, which was the main focus of this project. Instead I could have used any of the following packages to perform transformations without the use of Apache Spark and most likely would receive a performance boost.

Pandas
Polars
Delta-rs
DuckDB

More features could have been added to the silver table such as:

Other astronomical events
Geocoding addresses using an API (attemped but would cost money)
Links to pictures
Include slowy changing dimensions as updates are made to the website

Finally, a more indepth statistical analysis could be performed using:

Jupyter Lab
Dashboards (might revisit this in PowerBI/Tableau)

Name		Name	Last commit message	Last commit date
Latest commit History 118 Commits
img		img
logs		logs
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
start.sh		start.sh
teardown.sh		teardown.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UFO Sighting ELT Pipeline

About

ELT Diagram

Key Technologies:

Application services at runtime:

Installation

Improvements

About

Releases

Packages

Languages

jgrove90/ufo-deltalake

Folders and files

Latest commit

History

Repository files navigation

UFO Sighting ELT Pipeline

About

ELT Diagram

Key Technologies:

Application services at runtime:

Installation

Improvements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages