This project demonstrates a comprehensive SQL Data Warehousing solution, built from scratch using Azure Data Studio and SQL Server Express. It simulates an enterprise data pipeline by ingesting raw data from source systems, performing structured transformations, and delivering business-ready data for reporting and analytics using Medallion Architecture. It reflects industry best practices in Data Engineering, ETL design, and SQL-based data modeling.
This project involves:
- Data Architecture: Designing a Modern Data Warehouse Using Medallion Architecture Bronze, Silver, and Gold layers.
- ETL Pipelines: Extracting, transforming, and loading data from source systems into the warehouse.
- Data Modeling: Developing fact and dimension tables optimized for analytical queries.
Develop a modern data warehouse using SQL Server to consolidate sales data, enabling analytical reporting and informed decision-making.
- โ Data Sources: Import data from two source systems (ERP and CRM) provided as CSV files.
- โ Data Quality: Cleanse and resolve data quality issues prior to analysis.
- โ Integration: Combine both sources into a single, user-friendly data model designed for analytical queries.
- โ Scope: Focus on the latest dataset only; historization of data is not required.
- โ Documentation: Provide clear documentation of the data model to support both business stakeholders and analytics teams.
- โ
Business Use Cases:
- Customer segmentation and behavior analysis
- Product performance tracking
- Sales trends
This project utilizes the following tools and technologies for building, managing, and analyzing the SQL data warehouse:
The project follows a Medallion Architecture consisting of three key layers: Bronze, Silver, and Gold layers:
- โ Bronze Layer: Stores raw data as-is from the source systems. Data is ingested from CSV Files into SQL Server Database.
- โ Silver Layer: This layer includes data cleansing, standardization, and normalization processes to prepare data for analysis.
- โ Gold Layer: Houses business-ready data modeled into a star schema required for reporting and analytics.
This project demonstrates practical data profiling and cleansing, transforming messy raw data (Bronze) into clean, trusted analytics-ready data (Silver). The table below shows a breakdown of key data quality checks and transformations performed:
| Table | Issue | Records Affected | Resolution |
|---|---|---|---|
| crm_cust_info | Duplicates | 5 | Duplicates removed, kept latest by cst_create_date |
| Null customer IDs | 3 | Nulls excluded | |
| Unwanted spaces in names | 32 | Trimmed first & last names | |
Inconsistent genders (F, M, NULL) |
4,577 (NULL) |
Standardized to Male/Female; NULL -> N/A |
|
Inconsistent marital status (S, M, NULL) |
6 (NULL) |
Standardized to Single/Married; NULL -> N/A |
|
| crm_prd_info | Null product costs | 2 | Set to 0 |
Invalid product lines (M, R, S, T, NULL) |
17 (NULL) |
Mapped codes to descriptive lines; NULL โ N/A |
|
| Invalid date orders | 200 | Recalculated end dates based on next start date | |
| crm_sales_details | Null/invalid order dates | 19 | Fixed or set to NULL |
| Data inconsistencies in sales, quantity, and price | 35 | Recalculated sales & price as needed | |
| erp_cust_az12 | Invalid birthdates | 31 | Invalid dates set to NULL |
| Gender inconsistencies | 1,484 | Standardized to Male/Female; blanks -> N/A |
|
| erp_loc_a101 | Country synonyms & blanks | ~8,385 | Merged synonyms (US, USA -> United States, DE -> Germany); blanks -> N/A |
| erp_px_cat_g1v2 | Carriage returns | varied | Cleaned with REPLACE() |
โ Total records impacted:
- 18k+ CRM customers
- 18k+ ERP customers
- 60k+ sales transactions
- 13 country variations unified
The Gold Layer produces a clean, trusted star schema for analytics:
| Gold View | Description |
|---|---|
gold.dim_customers |
Combines CRM & ERP data for unified customer dimension. Adds surrogate customer_key. |
gold.dim_products |
Enriches CRM product info with ERP product categories. Adds surrogate product_key. |
gold.fact_sales |
Combines sales transactions with product & customer dimensions for a consistent fact table. |
๐ Key design: Uses
ROW_NUMBER()to generate surrogate keys, fallback logic for missing values, and joins for dimension enrichment.
SQL-Data-Warehouse/
โ
โโโ datasets/ # Raw datasets used for the project (ERP and CRM data)
โ
โโโ docs/ # Project documentation and architecture details
โ โโโ data_architecture.png # Image file for the project's architecture
โ โโโ data_catalog.md # Catalog of datasets, including field descriptions and metadata
โ โโโ data_flow.png # Image file for the data flow diagram
โ โโโ data_integration.png # Image file for the data integration diagram
โ โโโ data_model.png # Image file for data model (star schema)
โ โโโ naming-conventions.md # Consistent naming guidelines for tables, columns, and files
โ
โโโ scripts/ # SQL scripts for ETL and transformations
โ โโโ bronze/ # Scripts for extracting and loading raw data
โ โโโ silver/ # Scripts for cleaning and transforming data
โ โโโ gold/ # Scripts for creating analytical models
โ โโโ database_init.sql # Script for creating database and schemas
โ
โโโ tests/ # Test scripts and quality files
โ
โโโ README.md # Project overview and instructions
โโโ LICENSE # License information for the repository
โโโ .gitignore # Files and directories to be ignored by Git
-
Clone the Repository
git clone https://github.com/gloryodeyemi/SQL-Data-Warehouse.git cd SQL-Data-Warehouse -
Set Up SQL Server Environment
- Install SQL Server Express and Azure Data Studio (if not already installed).
- Initialize Database
- Run
scripts/database_init.sqlto initialize the database and schemas.
- Run Bronze Layer
- Create Bronze tables using
scripts/bronze/bronze_ddl.sql. - Load Bronze table
EXEC bronze.load_bronze_proc;
- Run Silver Layer
- Create Silver tables using
scripts/silver/silver_ddl.sql. - Load Silver table
EXEC silver.load_silver_proc;
- Run Gold Layer
- Create Gold VIEW using
scripts/gold/gold_ddl.sqlto generate business-ready data for analytics and reporting.
- Explore Data
- Use the star schema in the Gold layer for analytical queries and reporting.
- Data quality checks scripts in the tests/ folder ensure:
- Data consistency, accuracy, and standardization by checking for:
- Null or duplicate primary keys.
- Unwanted spaces in string fields.
- Data standardization and consistency.
- Invalid date ranges and orders.
- Data consistency between related fields.
- Uniqueness of surrogate keys in dimension tables.
- Referential integrity between fact and dimension tables.
- Validation of relationships in the data model for analytical purposes.
- Data consistency, accuracy, and standardization by checking for:
This project lays the foundation for a robust and scalable data warehouse. Future enhancements could include:
-
๐ SQL-Based Analytics
Develop advanced SQL queries to extract business insights such as:
- Customer segmentation
- Sales trends
- Product performance
- Revenue by country
-
๐ Integration with BI Tools
Connect the Gold layer to Business Intelligence tools like:
- Power BI
- Tableau
- Metabase
...to create interactive dashboards and self-service analytics for stakeholders.
-
๐ ๏ธ Automation & Scheduling
Use SQL Server Agent or external orchestration tools (e.g., Airflow, Azure Data Factory) to automate ETL pipelines and data refreshes.
-
๐ Role-Based Access Control (RBAC)
Enforce security policies and access levels depending on user roles (data analyst, data engineer, etc.)
-
๐ฆ Data Export APIs
Build export mechanisms for downstream systems and data consumers.
This project is licensed under the MIT License. You are free to use, modify, and share this project with proper attribution.
Hi there! I'm Glory Odeyemi, a Data Engineer & Analyst!
Let's stay in touch! Feel free to connect with me on the following platforms:


