Crowdfunding ETL

Project 2
EdX(2U) & UT Data Analytics and Visualization Bootcamp
Cohort UTA-VIRT-DATA-PT-11-2024-U-LOLC
By:
    Manny Guevara
    Neel Agarwal
    Rob LaPreze
    Samora Machel

Project Overview

This project is designed to showcase an end-to-end Extract, Transform, and Load (ETL) process for crowdfunding campaign data as part of the edX/2U Data Analytics Bootcamp.

The objectives of this project are:

Extract crowdfunding data from provided Excel spreadsheets.
Transform and clean the data into a relational database-friendly format.
Load the cleaned data into a PostgreSQL database using SQL scripts.

The project emphasizes reproducibility and scalability by providing optional automation through notebook-driven SQL generation, all while fulfilling the assignment rubric. It avoids GUI-based imports (like pgAdmin's CSV import tool) to maintain cross-environment compatibility, even though doing so would be a reasonable option as well.

Deliverables

Required Files

According to the project rubric, the following are required:

ETL_FINAL.ipynb: The main notebook that handles extraction, transformation, and CSV creation.
crowdfunding_db_schema.sql: Defines the database schema for the transformed data. Rough draft generated by schema_writer.ipynb.
ERD.jpg: Visual representation of the Entity Relationship Diagram (ERD) of the database.

Optional Enhancements

schema_writer.ipynb: Automatically generates rough-draft SQL schema and fully-functional import statements to bulk upload CSV Files
crowdfunding_db_import.sql: Optional script to bulk-load CSVs into the database using COPY commands. Not part of repo, but generated SQL code that comes from schema_writer.ipynb when all cells are run.

RAW_autoSchema.sql Clarification

RAW_autoSchema.sql:
(Generated, NOT FOR USE) – Referring to the below directory structure, this file represents an unrefined, preliminary version of the schema automatically produced before manual corrections. It lacks necessary constraints, proper VARCHAR lengths, and tailored adjustments. It remains in the repository for historical and development reference only.

⬆️ Return to TOC

Directory Structure

Crowdfunding_ETL/
│
├── ETL/
│   ├── Deliverables/
│   │   ├── crowdfunding_db_schema.sql (Required)
│   │   └── ETL_FINAL.ipynb (Required)
│   └── Extras/
│       ├── crowdfunding_db_import.sql (Generated/Optional)
│       ├── RAW_autoSchema.sql (Generated/NOT FOR USE)
│       └── schema_writer.ipynb (Optional)
│
├── Resources/
│   ├── Input/
│   │   ├── contacts.xlsx
│   │   └── crowdfunding.xlsx
│   └── Output/
│       ├── campaign.csv
│       ├── category.csv
│       ├── contacts.csv
│       └── subcategory.csv
│
├── .gitignore
├── ERD.jpg (Required)
└── README.md

⬆️ Return to TOC

System Requirements

To successfully run this project, the following environment is recommended:

Operating System: macOS, Windows, or Linux
Python Version: 3.9 or higher
PostgreSQL: Version 13 or higher
Required Python Libraries:
- pandas
- numpy
- pathlib (built-in)
- re (built-in)
- json (built-in)
- datetime (built-in)
Jupyter Notebook: For running .ipynb files (install via pip install notebook if needed)
PostgreSQL GUI (optional): pgAdmin 4 or any compatible interface
Disk Space: Minimum 500 MB free
Memory (RAM): Minimum 4 GB recommended

Ensure PostgreSQL is installed and running before executing the SQL scripts.

⬆️ Return to TOC

Installation & Setup

Clone or download this repository.

Ensure you have Python 3.x installed with pandas:

pip install pandas numpy 
# You should have ipykernel already, if not, install it
pip install ipykernel

(Optional) Set up a virtual environment:

cd YOUR/PATH/TO/REPO/HERE       # Move to your local repo

python -m venv venv             # Create a new virtual env

source venv/bin/activate        # Mac/Linux or with
venv\Scripts\activate           # Windows

pip install requirements.txt    # Install logged dependencies

deactivate                      # Deactivate your virtual env

Now you're ready to move to the Jupyter Notebook view!

Tip

As you might've gathered, a helper filer to automate SQL code was written. It has the ability to generate a SQL file to bulk-upload CSV type files. Feel free to upload in any way that is easiest, but to create this file please run all cells in schema_writer.ipynb.

⬆️ Return to TOC

ETL Process

1. Extraction

Decide on most efficient python based ETL tool.
Install dependencies.
Setup DataFrames so large rows of data can be read on-screen.
Read in crowdfunding.xlsx and contacts.xlsx as DataFrames.
Quick look at data values, types, and structure before transformation to grasp what kind of database is being built.

2. Transformation

Create four CSVs: campaign.csv, contacts.csv, category.csv, subcategory.csv by cleaning, editing, and normalizing data.

Split "category & sub-category" into separate "category" and "subcategory":
- crowdfunding.xlsx hold combined info that needed to be methodically separated
- Create new identifiers for each entry in new tables to be linked with main table
- Move into one new table for each along with unique identifiers
Clean the crowdfunding (main) DataFrame and perform major edits:
- removing unnecessary sets from tables
- renaming/reordering columns to be more informative and match conventions
- merging to insert foreign key designations
- standardizing data types for pandas ETL and SQL Schema
Mitigate bad input formatting to allow contacts.xlsx to be usable
- Select one of two methods to parse through the DataFrame created from contacts.xlsx (Pandas or Regular Expressions)
- While both methods were completed for learning's sake and to measure efficiency, the Pandas method was selected for said reasoning.
- Original Data was stored incorrectly using dictionary i.e. {} curly-brackets within excel, rendering the data unreadable by normal means.

3. Loading

Create/connect to crowdfunding_db in PostgreSQL
Add Schema into db using crowdfunding_db_schema.sql.
Load data into db:
- Using PostgreSQL's importing tool in the pgAdmin GUI
- Using psql
- Open schema_writer.ipynb and run all cells. Afterwards a crowdfunding_db_import.sql file can be found which can bulk-import CSV when run in pgAdmin 4.

While the basic requirements for this Project specify the inclusion of a schema file, the process of converting a pandas DataFrame into PostgreSQL Schema can be arduous and error-prone, so a helper notebook file was written to do most of the heavy lifting and data validation. Constraint edting, rounding up VARCHARs, and fixing DATE type in particular were done after the automatic building of the schema file. However, the bulk-imports that come from this helper file are fine to use if not even easier!

⬆️ Return to TOC

Schema Creation & Data Import

Note

The VARCHAR lengths for variable fields (like first_name, last_name, email, company_name, and description) were determined by the maximum string lengths from the dataset. These lengths were manually rounded up to accommodate longer entries in future datasets, preventing any data truncation issues.

Example Table Creation

RAW_autoSchema

Note the lack of Primary and Foreign Key declarations as well as correcting description to be enclosed with double-quotes ("") to ensure it's read properly as a column name and not as a keyword. This is an example of the auto-generated SQL code created as a proof-of-concept, to speed up schema creation, and to ensure it's done with errors. For more information on Raw_autoSchema please click here.

CREATE TABLE campaign (
    cf_id INT
    ,contact_id INT 
    ,company_name VARCHAR(33) NOT NULL
    ,description VARCHAR(53) 
    ,goal DEC NOT NULL
    ,pledged DEC NOT NULL
    ,outcome VARCHAR(10) NOT NULL
    ,backers_count INT NOT NULL
    ,country CHAR(2) NOT NULL
    ,currency CHAR(3) NOT NULL
    ,launch_date CHAR(10) NOT NULL
    ,end_date CHAR(10) 
    ,category_id CHAR(4) 
    ,subcategory_id VARCHAR(8) 
);

crowdfunding_db_schema

This is a snipped of the finalized schema code, after final edits have taken place. Keys are now declared properly, some VARCHARS have been rounded-up (albeit they could've been rounded up even more since they're personal identification materials), and launch_date and end_date have been changed to PostgreSQL DATE type.

CREATE TABLE campaign (
    cf_id INT
    ,contact_id INT 
    ,company_name VARCHAR(40) NOT NULL
    ,"description" VARCHAR(60)
    ,goal DEC NOT NULL
    ,pledged DEC NOT NULL
    ,outcome VARCHAR(10) NOT NULL
    ,backers_count INT NOT NULL
    ,country CHAR(2) NOT NULL
    ,currency CHAR(3) NOT NULL
    ,launch_date DATE NOT NULL
    ,end_date DATE
    ,category_id CHAR(4) 
    ,subcategory_id VARCHAR(8) 
    ,CONSTRAINT pk_cf_id 
        PRIMARY KEY (cf_id)
    ,CONSTRAINT fk_contact_id 
        FOREIGN KEY (contact_id) 
        REFERENCES contacts(contact_id)
    ,CONSTRAINT fk_cat_id 
        FOREIGN KEY (category_id) 
        REFERENCES category(category_id)
    ,CONSTRAINT fk_subcat_id 
        FOREIGN KEY (subcategory_id) 
        REFERENCES subcategory(subcategory_id)
);

Example Import

Note

Importing can be happily done using the GUI, but other options have been provided! Please run all cells in schema_writer.ipynb for a auto generated SQL Import file called crowdfunding_db_import.sql.

Using Bulk-Imports

COPY 
    campaign
FROM 
    '/your/local/path/Resources/Output/campaign.csv'
DELIMITER ',' CSV HEADER;

Using PSQL/SQL Shell

-- Optional:
\i crowdfunding_db_schema.sql
\i crowdfunding_db_import.sql

⬆️ Return to TOC

Regex Approach (Optional)

An alternative method explored involved using regular expressions to extract fields from semi-structured text. This was ultimately replaced by pandas-based logic for efficiency.

Example Regex Parsing

import re

pattern_id = r'\d{4}'
pattern_email = r'"email": "(.+)"'
pattern_name = r'(\w+\s\w+)'

for row in contact_info_df['raw_data_column']:
    found_id = re.search(pattern_id, row).group()
    found_email = re.search(pattern_email, row).group(1)
    found_name = re.search(pattern_name, row).group()

⬆️ Return to TOC

ERD Overview

Normalization:

This database schema follows normalization principles to minimize redundancy, maintain data integrity, and ensure scalability. By splitting data into tables like category, subcategory, contacts, and campaign, we:

Reduce repeated data entries.
Simplify updates to shared information.
Improve database performance.

For example:

Categories and subcategories are separated to prevent repeated text entries in multiple campaigns.
Contacts are stored independently to link multiple campaigns to a single contact entry.

Relationships:

campaign.contact_id → contacts.contact_id
campaign.category_id → category.category_id
campaign.subcategory_id → subcategory.subcategory_id

⬆️ Return to TOC

Limitations

VARCHAR Length Sensitivity:
While the maximum observed string lengths in the dataset were manually rounded up to define VARCHAR limits for fields like first_name, last_name, email, company_name, and description, future datasets with significantly longer values may require further schema modifications to avoid truncation.
Manual Schema Refinements:
Although the schema_writer.ipynb automates SQL generation, elements like primary and foreign keys, certain constraints like NOT NULL, and buffers for VARCHAR were applied manually or outside the automated process. Future iterations could integrate these features directly into the schema generator.
Limited Data Validation:
The ETL process performs cleaning and transformation, but deeper validation (such as checking for duplicate records, malformed emails, or invalid date ranges) is minimal. Adding robust data validation steps could help ensure higher data quality.
Static File Paths:
The generated import SQL file includes file paths that may need to be manually adjusted depending on the user's environment. This may limit portability across systems unless dynamic path handling is integrated.
Regex Solution Not Integrated:
While an alternative regex-based contact parser was created, it remains outside the main workflow. Incorporating it as a selectable option within the ETL pipeline could provide flexibility for future datasets with more irregular formatting.
Scalability:
This project was developed with relatively small sample datasets. Performance with much larger datasets (millions of records) has not been tested and may require optimization of memory handling and database indexing.
Normalization Trade-offs:
The schema is normalized to reduce redundancy, but highly normalized schemas can sometimes lead to more complex queries and slower performance in certain analytics workflows. Denormalization strategies may be considered depending on usage needs.

⬆️ Return to TOC

Future Work

Automate the inclusion of schema constraints (primary/foreign keys, uniqueness) directly within schema_writer.ipynb.
Introduce enhanced validation during the ETL process (such as email pattern checks, duplicate detection, and date validations).
Add support for dynamic file path handling in the SQL import script to improve cross-system portability.
Explore potential denormalization strategies to optimize read-heavy workloads and large-scale reporting.

⬆️ Return to TOC

Usage Notes & Tips

Path Fixes: You can update file paths if moving files, however files are organized for clean runs
Manual Editing: You can manually revise crowdfunding_db_schema.sql if you need additional constraints like PRIMARY KEY or FOREIGN KEY.
Enhancement: Run schema_writer.ipynb for easy PostgreSQL file uploading, ignore the extra RAW_autoSchema.sql file that comes with it.
Optional Tools: The schema_writer.ipynb and the import script are purely optional. They demonstrate an automated approach but aren’t required per the rubric.

⬆️ Return to TOC

Credits & Citations

Collaborators:
- Manny Guevara
- Neel Agarwal
- Rob LaPreze
- Samora Machel
Starter Code & Data: Provided by UT/edX/2U Data Analytics Bootcamp.
README.md: Created using OpenAI's ChatGPT LLM, trained using prior READMEs, all the deliverables, and the provided rubric given by edX/2U
PostgreSQL Docs: postgresql.org/docs
pandas Docs: pandas.pydata.org/docs

⬆️ Return to TOC

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crowdfunding ETL

Table of Contents

Project Overview

Deliverables

Required Files

Optional Enhancements

RAW_autoSchema.sql Clarification

Directory Structure

System Requirements

Installation & Setup

ETL Process

1. Extraction

2. Transformation

3. Loading

Schema Creation & Data Import

Example Table Creation

RAW_autoSchema

crowdfunding_db_schema

Example Import

Using Bulk-Imports

Using PSQL/SQL Shell

Regex Approach (Optional)

Example Regex Parsing

ERD Overview

Normalization:

Relationships:

Limitations

Future Work

Usage Notes & Tips

Credits & Citations

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
ETL		ETL
Resources		Resources
.gitignore		.gitignore
ERD.jpg		ERD.jpg
README.md		README.md
requirements.txt		requirements.txt

Neelka96/Crowdfunding_ETL

Folders and files

Latest commit

History

Repository files navigation

Crowdfunding ETL

Table of Contents

Project Overview

Deliverables

Required Files

Optional Enhancements

RAW_autoSchema.sql Clarification

Directory Structure

System Requirements

Installation & Setup

ETL Process

1. Extraction

2. Transformation

3. Loading

Schema Creation & Data Import

Example Table Creation

RAW_autoSchema

crowdfunding_db_schema

Example Import

Using Bulk-Imports

Using PSQL/SQL Shell

Regex Approach (Optional)

Example Regex Parsing

ERD Overview

Normalization:

Relationships:

Limitations

Future Work

Usage Notes & Tips

Credits & Citations

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages