Skip to content

🛸 This project showcases an Extract, Load, Transform (ELT) pipeline built with Python, Apache Spark, Delta Lake, and Docker. The objective of the project is to scrape UFO sighting data from NUFORC and process it through the Medallion architecture to create a star schema in the Gold layer that is ready for analysis.

Notifications You must be signed in to change notification settings

jgrove90/ufo-deltalake

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mr.Grey

UFO Sighting ELT Pipeline

About • Installation • Dashboard • ELT Diagram • Improvements

About

This project showcases an Extract, Load, Transform (ELT) pipeline built with Python, Apache Spark, Delta Lake, and Docker. The objective of the project is to scrape UFO sighting data from the National UFO Reporting Center (NUFORC) and process it through the Medallion architecture to create a star schema in the Gold layer.

The pipeline begins by utilizing Python for web scraping, extracting UFO sighting data from NUFORC. The scraped data is then transformed using Apache Spark, a powerful big data processing framework. Spark enables efficient data manipulation, cleansing, and aggregation tasks on the extracted data.

To ensure reliability and scalability, the data is stored in Delta Lake, an open-source storage layer built on top of Apache Parquet and Apache Spark. Delta Lake provides ACID transactions, schema enforcement, and versioning capabilities, making it ideal for data pipeline workflows.

The project is containerized using Docker, allowing for easy deployment and reproducibility across different environments. Docker enables seamless packaging and distribution of the entire pipeline, ensuring consistent execution and dependency management.

The result is a well-organized ELT pipeline that follows the Medallion architecture principles, with Bronze, Silver, and Gold layers. The Bronze layer contains the raw, unprocessed data. The Silver layer represents the transformed and cleansed data, while the Gold layer consists of a star schema, enabling efficient querying and analysis.

ELT Diagram

Mr.Grey

Key Technologies:

  • Python: Web scraping and scripting language
  • Apache Spark: Big data processing and transformation framework
  • Delta Lake: Data storage layer with ACID transactions and versioning
  • Docker: Containerization platform for easy deployment and reproducibility
  • Tableau: Visual analytics platform

Application services at runtime:

  • One spark driver
  • One spark master
  • Two spark worker nodes
  • Spark History Server
  • Jupyter Lab

Installation

  1. Download Docker Desktop and start docker
  2. Clone Repo
git clone https://github.com/jgrove90/ufo-deltalake.git
  1. Run start.sh to start the spark application NOTE: You may need to adjust the resources allocated to the master and worker nodes to match your system resources. These settings can be found in: ./src/spark/spark-defaults.conf
sh start.sh
  1. Access application services via the web browser
    • Spark Master UI - http://localhost:7070/
    • Spark History Server - http://localhost:18080/
    • Jupyter Lab - http://localhost:8888/
  2. Run teardown.sh to remove application from system including docker images
sh teardown.sh

Improvements

This project is over engineered as Apache Spark is best suited for big data but I wanted to explore the use of Delta Lake, which was the main focus of this project. Instead I could have used any of the following packages to perform transformations without the use of Apache Spark and most likely would receive a performance boost.

  • Pandas
  • Polars
  • Delta-rs
  • DuckDB

More features could have been added to the silver table such as:

  • Other astronomical events
  • Geocoding addresses using an API (attemped but would cost money)
  • Links to pictures
  • Include slowy changing dimensions as updates are made to the website

Finally, a more indepth statistical analysis could be performed using:

  • Jupyter Lab
  • Dashboards (might revisit this in PowerBI/Tableau)

About

🛸 This project showcases an Extract, Load, Transform (ELT) pipeline built with Python, Apache Spark, Delta Lake, and Docker. The objective of the project is to scrape UFO sighting data from NUFORC and process it through the Medallion architecture to create a star schema in the Gold layer that is ready for analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published