Project 2
EdX(2U) & UT Data Analytics and Visualization Bootcamp
Cohort UTA-VIRT-DATA-PT-11-2024-U-LOLC
By:
Manny Guevara
Neel Agarwal
Rob LaPreze
Samora Machel
- Project Overview
- Deliverables
- Directory Structure
- System Requirements
- Installation & Setup
- ETL Process
- Schema Creation & Data Import
- Regex Approach (Optional)
- ERD Overview
- Limitations
- Future Work
- Usage Notes & Tips
- Credits & Citations
This project is designed to showcase an end-to-end Extract, Transform, and Load (ETL) process for crowdfunding campaign data as part of the edX/2U Data Analytics Bootcamp.
The objectives of this project are:
- Extract crowdfunding data from provided Excel spreadsheets.
- Transform and clean the data into a relational database-friendly format.
- Load the cleaned data into a PostgreSQL database using SQL scripts.
The project emphasizes reproducibility and scalability by providing optional automation through notebook-driven SQL generation, all while fulfilling the assignment rubric. It avoids GUI-based imports (like pgAdmin's CSV import tool) to maintain cross-environment compatibility, even though doing so would be a reasonable option as well.
According to the project rubric, the following are required:
ETL_FINAL.ipynb
: The main notebook that handles extraction, transformation, and CSV creation.crowdfunding_db_schema.sql
: Defines the database schema for the transformed data. Rough draft generated byschema_writer.ipynb
.ERD.jpg
: Visual representation of the Entity Relationship Diagram (ERD) of the database.
schema_writer.ipynb
: Automatically generates rough-draft SQL schema and fully-functional import statements to bulk upload CSV Filescrowdfunding_db_import.sql
: Optional script to bulk-load CSVs into the database usingCOPY
commands. Not part of repo, but generated SQL code that comes fromschema_writer.ipynb
when all cells are run.
RAW_autoSchema.sql:
(Generated, NOT FOR USE) – Referring to the below directory structure, this file represents an unrefined, preliminary version of the schema automatically produced before manual corrections. It lacks necessary constraints, proper VARCHAR
lengths, and tailored adjustments. It remains in the repository for historical and development reference only.
Crowdfunding_ETL/
│
├── ETL/
│ ├── Deliverables/
│ │ ├── crowdfunding_db_schema.sql (Required)
│ │ └── ETL_FINAL.ipynb (Required)
│ └── Extras/
│ ├── crowdfunding_db_import.sql (Generated/Optional)
│ ├── RAW_autoSchema.sql (Generated/NOT FOR USE)
│ └── schema_writer.ipynb (Optional)
│
├── Resources/
│ ├── Input/
│ │ ├── contacts.xlsx
│ │ └── crowdfunding.xlsx
│ └── Output/
│ ├── campaign.csv
│ ├── category.csv
│ ├── contacts.csv
│ └── subcategory.csv
│
├── .gitignore
├── ERD.jpg (Required)
└── README.md
To successfully run this project, the following environment is recommended:
- Operating System: macOS, Windows, or Linux
- Python Version: 3.9 or higher
- PostgreSQL: Version 13 or higher
- Required Python Libraries:
pandas
numpy
pathlib
(built-in)re
(built-in)json
(built-in)datetime
(built-in)
- Jupyter Notebook: For running
.ipynb
files (install viapip install notebook
if needed) - PostgreSQL GUI (optional): pgAdmin 4 or any compatible interface
- Disk Space: Minimum 500 MB free
- Memory (RAM): Minimum 4 GB recommended
Ensure PostgreSQL is installed and running before executing the SQL scripts.
- Clone or download this repository.
- Ensure you have Python 3.x installed with pandas:
pip install pandas numpy # You should have ipykernel already, if not, install it pip install ipykernel
- (Optional) Set up a virtual environment:
cd YOUR/PATH/TO/REPO/HERE # Move to your local repo python -m venv venv # Create a new virtual env source venv/bin/activate # Mac/Linux or with venv\Scripts\activate # Windows pip install requirements.txt # Install logged dependencies deactivate # Deactivate your virtual env
- Now you're ready to move to the Jupyter Notebook view!
Tip
As you might've gathered, a helper filer to automate SQL code was written. It has the ability to generate a SQL file to bulk-upload CSV type files. Feel free to upload in any way that is easiest, but to create this file please run all cells in schema_writer.ipynb
.
- Decide on most efficient python based ETL tool.
- Install dependencies.
- Setup DataFrames so large rows of data can be read on-screen.
- Read in
crowdfunding.xlsx
andcontacts.xlsx
as DataFrames. - Quick look at data values, types, and structure before transformation to grasp what kind of database is being built.
Create four CSVs: campaign.csv
, contacts.csv
, category.csv
, subcategory.csv
by cleaning, editing, and normalizing data.
- Split
"category & sub-category"
into separate"category"
and"subcategory"
:crowdfunding.xlsx
hold combined info that needed to be methodically separated- Create new identifiers for each entry in new tables to be linked with main table
- Move into one new table for each along with unique identifiers
- Clean the
crowdfunding
(main) DataFrame and perform major edits:- removing unnecessary sets from tables
- renaming/reordering columns to be more informative and match conventions
- merging to insert foreign key designations
- standardizing data types for pandas ETL and SQL Schema
- Mitigate bad input formatting to allow
contacts.xlsx
to be usable- Select one of two methods to parse through the DataFrame created from
contacts.xlsx
(Pandas or Regular Expressions) - While both methods were completed for learning's sake and to measure efficiency, the Pandas method was selected for said reasoning.
- Original Data was stored incorrectly using dictionary i.e. {} curly-brackets within excel, rendering the data unreadable by normal means.
- Select one of two methods to parse through the DataFrame created from
- Create/connect to
crowdfunding_db
in PostgreSQL - Add Schema into db using
crowdfunding_db_schema.sql
. - Load data into db:
- Using PostgreSQL's importing tool in the pgAdmin GUI
- Using psql
- Open
schema_writer.ipynb
and run all cells. Afterwards acrowdfunding_db_import.sql
file can be found which can bulk-import CSV when run in pgAdmin 4.
While the basic requirements for this Project specify the inclusion of a schema file, the process of converting a pandas DataFrame into PostgreSQL Schema can be arduous and error-prone, so a helper notebook file was written to do most of the heavy lifting and data validation. Constraint edting, rounding up VARCHARs, and fixing DATE type in particular were done after the automatic building of the schema file. However, the bulk-imports that come from this helper file are fine to use if not even easier!
Note
The VARCHAR
lengths for variable fields (like first_name
, last_name
, email
, company_name
, and description
) were determined by the maximum string lengths from the dataset. These lengths were manually rounded up to accommodate longer entries in future datasets, preventing any data truncation issues.
Note the lack of Primary and Foreign Key declarations as well as correcting description
to be enclosed with double-quotes ("") to ensure it's read properly as a column name and not as a keyword. This is an example of the auto-generated SQL code created as a proof-of-concept, to speed up schema creation, and to ensure it's done with errors. For more information on Raw_autoSchema please click here.
CREATE TABLE campaign (
cf_id INT
,contact_id INT
,company_name VARCHAR(33) NOT NULL
,description VARCHAR(53)
,goal DEC NOT NULL
,pledged DEC NOT NULL
,outcome VARCHAR(10) NOT NULL
,backers_count INT NOT NULL
,country CHAR(2) NOT NULL
,currency CHAR(3) NOT NULL
,launch_date CHAR(10) NOT NULL
,end_date CHAR(10)
,category_id CHAR(4)
,subcategory_id VARCHAR(8)
);
This is a snipped of the finalized schema code, after final edits have taken place. Keys are now declared properly, some VARCHARS have been rounded-up (albeit they could've been rounded up even more since they're personal identification materials), and launch_date
and end_date
have been changed to PostgreSQL DATE type.
CREATE TABLE campaign (
cf_id INT
,contact_id INT
,company_name VARCHAR(40) NOT NULL
,"description" VARCHAR(60)
,goal DEC NOT NULL
,pledged DEC NOT NULL
,outcome VARCHAR(10) NOT NULL
,backers_count INT NOT NULL
,country CHAR(2) NOT NULL
,currency CHAR(3) NOT NULL
,launch_date DATE NOT NULL
,end_date DATE
,category_id CHAR(4)
,subcategory_id VARCHAR(8)
,CONSTRAINT pk_cf_id
PRIMARY KEY (cf_id)
,CONSTRAINT fk_contact_id
FOREIGN KEY (contact_id)
REFERENCES contacts(contact_id)
,CONSTRAINT fk_cat_id
FOREIGN KEY (category_id)
REFERENCES category(category_id)
,CONSTRAINT fk_subcat_id
FOREIGN KEY (subcategory_id)
REFERENCES subcategory(subcategory_id)
);
Note
Importing can be happily done using the GUI, but other options have been provided! Please run all cells in schema_writer.ipynb
for a auto generated SQL Import file called crowdfunding_db_import.sql
.
COPY
campaign
FROM
'/your/local/path/Resources/Output/campaign.csv'
DELIMITER ',' CSV HEADER;
-- Optional:
\i crowdfunding_db_schema.sql
\i crowdfunding_db_import.sql
An alternative method explored involved using regular expressions to extract fields from semi-structured text. This was ultimately replaced by pandas-based logic for efficiency.
import re
pattern_id = r'\d{4}'
pattern_email = r'"email": "(.+)"'
pattern_name = r'(\w+\s\w+)'
for row in contact_info_df['raw_data_column']:
found_id = re.search(pattern_id, row).group()
found_email = re.search(pattern_email, row).group(1)
found_name = re.search(pattern_name, row).group()
This database schema follows normalization principles to minimize redundancy, maintain data integrity, and ensure scalability. By splitting data into tables like category
, subcategory
, contacts
, and campaign
, we:
- Reduce repeated data entries.
- Simplify updates to shared information.
- Improve database performance.
For example:
- Categories and subcategories are separated to prevent repeated text entries in multiple campaigns.
- Contacts are stored independently to link multiple campaigns to a single contact entry.
campaign.contact_id
→contacts.contact_id
campaign.category_id
→category.category_id
campaign.subcategory_id
→subcategory.subcategory_id
-
VARCHAR
Length Sensitivity:
While the maximum observed string lengths in the dataset were manually rounded up to defineVARCHAR
limits for fields likefirst_name
,last_name
,email
,company_name
, anddescription
, future datasets with significantly longer values may require further schema modifications to avoid truncation. -
Manual Schema Refinements:
Although theschema_writer.ipynb
automates SQL generation, elements like primary and foreign keys, certain constraints likeNOT NULL
, and buffers for VARCHAR were applied manually or outside the automated process. Future iterations could integrate these features directly into the schema generator. -
Limited Data Validation:
The ETL process performs cleaning and transformation, but deeper validation (such as checking for duplicate records, malformed emails, or invalid date ranges) is minimal. Adding robust data validation steps could help ensure higher data quality. -
Static File Paths:
The generated import SQL file includes file paths that may need to be manually adjusted depending on the user's environment. This may limit portability across systems unless dynamic path handling is integrated. -
Regex Solution Not Integrated:
While an alternative regex-based contact parser was created, it remains outside the main workflow. Incorporating it as a selectable option within the ETL pipeline could provide flexibility for future datasets with more irregular formatting. -
Scalability:
This project was developed with relatively small sample datasets. Performance with much larger datasets (millions of records) has not been tested and may require optimization of memory handling and database indexing. -
Normalization Trade-offs:
The schema is normalized to reduce redundancy, but highly normalized schemas can sometimes lead to more complex queries and slower performance in certain analytics workflows. Denormalization strategies may be considered depending on usage needs.
- Automate the inclusion of schema constraints (primary/foreign keys, uniqueness) directly within
schema_writer.ipynb
. - Introduce enhanced validation during the ETL process (such as email pattern checks, duplicate detection, and date validations).
- Add support for dynamic file path handling in the SQL import script to improve cross-system portability.
- Explore potential denormalization strategies to optimize read-heavy workloads and large-scale reporting.
- Path Fixes: You can update file paths if moving files, however files are organized for clean runs
- Manual Editing: You can manually revise
crowdfunding_db_schema.sql
if you need additional constraints likePRIMARY KEY
orFOREIGN KEY
. - Enhancement: Run
schema_writer.ipynb
for easy PostgreSQL file uploading, ignore the extraRAW_autoSchema.sql
file that comes with it. - Optional Tools: The
schema_writer.ipynb
and the import script are purely optional. They demonstrate an automated approach but aren’t required per the rubric.
- Collaborators:
- Manny Guevara
- Neel Agarwal
- Rob LaPreze
- Samora Machel
- Starter Code & Data: Provided by UT/edX/2U Data Analytics Bootcamp.
- README.md: Created using OpenAI's ChatGPT LLM, trained using prior READMEs, all the deliverables, and the provided rubric given by edX/2U
- PostgreSQL Docs: postgresql.org/docs
- pandas Docs: pandas.pydata.org/docs