Shop online for groceries and get your order delivered at your doorstep in minutes. Enjoy instant delivery with blinkit. Blinkit primarily delivers groceries, fresh fruits, vegetables, meat, stationery, bakery items, personal care, baby care and pet care products, snacks, flowers, etc.
The data is available in 8 csv files:
- customers.csv
- orderdetails.csv
- orders.csv
This project focused on addressing key business challenges for Blinkit, including revenue optimization, customer retention, and operational efficiency. The main goals were:
- Revenue Insights: Pinpoint top-performing products and cities generating the highest revenue.
- Customer Analysis: Understand customer behavior by ranking customers based on spending and identifying inactive customers.
- Order Trends: Uncover patterns in order statuses (delivered, canceled) and customer behavior, like placing multiple orders in a day.
- Operational Insights: Optimize service delivery and improve operational efficiency by analyzing location-specific purchase data.
IMDb (an initialism for Internet Movie Database) is an online database of information related to films, television series, podcasts, home videos, video games, and streaming content online – including cast, production crew and personal biographies, plot summaries, trivia, ratings, and fan and critical reviews.
The dataset used in this analysis can be found in this repository files : imdb_movies.csv.
This project analyzes the IMDB Movie Dataset to uncover trends, popular genres, and factors influencing movie success. Using Python and libraries like Pandas, Numpy, Matplotlib and Seaborn, the analysis delivers actionable insights through comprehensive data exploration and visualization.
The task description of this analysis can be found in this repository files : Task Covered.
- Project Setup and Data Loading
- Data Overview and Cleaning
- Univariate and Bivariate Analysis
- Genre-Specific Analysis
- Yearly and Decadal Trends
- Insights and Summary
- Documentation
This project anlayzed the highest paying IT jobs in India, focusing on job positions, location, salaries, education, and experience levels. It provides insights into salary trends, correltations and distributions, helping to understand the IT sector's job market and the factors influencing compensation.
The dataset used in this analysis can be found in this repository file: position_salary.csv
This project provides key insights into the IT job market in India, analyzing factors like job positions, salary trends, and the impact of education and experience on compensation. It helps professionals understand what roles offer the highest pay and the essential skills required for those positions. The analysis will focus on job positions, locations, salaries, educational qualifications, and experience levels.
- Project Setup and Data Loading
- Exploratory Data Analysis
- Data Cleaning
- Solved Basic Level Quesitons and Interpretation
- Solved Intermediate Level Question and Interpretation
Microsoft Excel is a spreadsheet editor developed by Microsoft.
This section content the understanding of Excel and its capability. (PS: I used my own created data for understanding and using of different excel formulas)
- Excel Formula: SUM, SUMIF, SUMIFs, COUNT, COUNTIF, COUNTIFs, MAX, MIN, CONCATENATE, RIGHT, LEFT, LEN, NETWORKDAYS
- Condition Formula: IF, IFs,Condition Formatting
- Date Formattion; Date-to-Text, Text-to-Date
- XLOOKUP & VLOOKUP
- Pivot Table
This project is a comprehensive SQL-based analysis of the mining industry ideal for beginners and intermediate learners.
This project is centered around analyzing them mining industry using real-world dataset from the World Mining Commodities dataset repository. It involoes exploring the mining companies, their operaitonal data, and country-level mining statistics to generate insights and develop SQL proficiency.
To analyze global mining trends, understand commodity production levels, and evaluate company performance using SQL.
The dataset used in this analysis can be found in this repository file: i. 116_world_mining_companies_clean.csv ii. world_mining_commodities_clean.csv
I analyzed 30,000+ records from Vrinda Store, an online clothing retailer to develop a data-driven annual sales report. This project present insights from the analysis of Vrinda Store's 2022 data, integrating sales performance, customer demographics, and operational metrics.
The dataset used in this analysis can be found in this repository file: Vrinda Store Data Analysis Excel.xlsx
The objective is to create a comprehensive annual sales report to guide strategic decisions and enhance sales in 2023.
🔹Data Cleaning and Preprocessing: Removed duplicates, corrected errors, structured sales data for analysis 🔹Data Analysis: Identified sales trends, top-performing products, and customer preference 🔹Data Visualization & Reports: Created interactive dashboard using hashtag#Excel (pivot tables, charts, & advanced formulas) 🔹SQL (CTE, Group By, Aggregate Functions, Rank Functions) 🔹Insights & Documentation: Provided actionable insights for data-driven decision-making
- Adults are the largest consumer group, followed by young and the elderly ones.
- Sales decline month over month, singaling a need for proactive stategies.
- Amazon tops the charts as the best-selling partner.
- Maharashtra emerges as the top-performing state with sales of ₹2,990,221.
- March recorded the highest revenue of ₹1,928,066 with 2,819 orders, indicating seasonal trends.
- Order Status: 28,641 orders delivered successfully, while only 844 were cancelled and 517 were refunded, showcasing high delivery efficiency
This project involves leveraging SQL queries & building PowerBI Dashboard to analyze 𝐩𝐢𝐳𝐳𝐚 𝐬𝐚𝐥𝐞𝐬 data, aiming to uncover valuable insights that drive strategic decision-making and optimize business operations.
The dataset used in this analysis can be found in this repository file: pizza_sales.csv
𝐎𝐛𝐣𝐞𝐜𝐭𝐢𝐯𝐞:-
The objective of this project is to analyze 𝐩𝐢𝐳𝐳𝐚 🍕 𝐬𝐚𝐥𝐞𝐬 data to identify trends and provide actionable insights that can help to increase sales and aim to uncover key metrics and patterns within the sales data by leveraging SQL queries & by building PowerBI Dashboard
- Total revenue: $817860.05
- Expensive pizza: The Greek pizza
- Top Category: Classic
- Busiest Hour: 12:00
- Average Order Value: $17
- Highest Revenue Contribution: Classic Pizza
- Large pizzas are the most commonly ordered size, accounting for 𝟏𝟖𝟓𝟐𝟔 orders.
- The Classic Deluxe Pizza is the most ordered type, totaling 𝟐𝟒𝟓𝟑 orders.
- Classic pizza dominates total revenue with a contribution of 𝟐𝟔.𝟗𝟏%.
- The 𝐓𝐡𝐚𝐢 𝐂𝐡𝐢𝐜𝐤𝐞𝐧 𝐏𝐢𝐳𝐳𝐚 emerges as a top revenue generator.
- 𝐀𝐯𝐞𝐫𝐚𝐠𝐞 𝐩𝐢𝐳𝐳𝐚 sales stand at 𝟏𝟑𝟖.
- The Thai Chicken Pizza ($43434.25), Barbecue Chicken Pizza ($42768) and California Chicken Pizza ($41409.5) generate the highest revenue.
The analysis explores customer churn patterns, focusing on various factors such as payment methods, contract types, tenure, and demographic attributes. The goal is to identify which factors are most strongly associated with higher churn rates to guide customer retention strategies.
The dataset used in this analysis can be found in this repository file: Customer Churn.csv
● Contract Type and Churn:
- Customers on month-to-month contracts exhibit the highest churn rate, with 42% of such customers likely to churn.
- In contrast, customers on one-year and two-year contracts have churn rates of 11% and 3%, respectively.
- Implication: Longer contract periods serve as a strong retention tool, as customers with extended commitments are far less likely to leave.
● Payment Methods and Churn:
- Customers paying via electronic checks show the highest churn rate at 45%, while those using credit cards, bank transfers, or mailed checks have significantly lower churn rates, averaging around 15-18%.
- Implication: The convenience, security, and trust issues related to electronic payments might be contributing factors. Encouraging customers to switch to more stable payment methods could reduce churn.
● Churn by Tenure:
- Customers with less than one year of tenure are the most likely to churn, with a 50% churn rate. Those with 1-3 years of tenure show a decreasing churn trend at 35%, while customers who have been with the company for more than three years have a churn rate of just 15%.
- Implication: Engaging customers early in their journey, especially within the first year, is critical for retention.
● Churny Internet Service Type:
- Customers using Fiber Optic services show a higher churn rate of 30%, compared to DSL customers with a churn rate of 20%.
- Implication: This could be due to increased competition or dissatisfaction with service quality. Understanding customer satisfaction with service speed and reliability may help retain fiber optic users.
● Senior Citizens and Churn:
- The analysis reveals that senior citizens (aged 65+) have a churn rate of 41%, compared to a 26% churn rate among non-senior citizens.
- Implication: Special retention programs and targeted customer service for senior customers may help reduce churn in this demographic.
- PromoteLong-Term Contracts: Offer incentives for customers to commit to longer contracts to reduce churn.
- AddressPayment Method Concerns: Implement campaigns encouraging customers to switch from electronic checks to more reliable payment methods.
- CustomerEngagement in Early Tenure: Focus on improving the customer experience within the first year, as churn is highest in this period.
- Special Senior Citizen Retention Programs: Create personalized offers or assistance programs to retain the senior demographic.
Using Excel, I designed a dynamic and interactive dashboard providing a deep dive into Blinkit's Grocery sales data. Here's what the dashboard highlights:
Excel Worksheet: BlinkIT Grocery Data Excel
Read Documentation: Blinkit Grocery Sales Analysis Documentation
- Focus on High-Selling Categories: Increase marketing efforts for Fruits and Vegetables and Snack Foods, which together contribute nearly 30% of total sales.
- Expand Regular Fat Content Offerings: As this category constitutes 64.6% of total sales, diversifying products within this segment can further boost revenue.
- Tier 1 Locations and High-Size Outlets: Invest in these high-performing areas and outlet types to maximize returns.
- Stabilize Seasonal Trends: Identify reasons behind sales fluctuations and implement strategies like discounts or promotional campaigns to sustain consistent revenue across all years.
- Customer Retention: Maintain the high average customer rating of 4.0 by focusing on quality and timely delivery, ensuring continued customer satisfaction.
The is an in-depth analysis of electric vehicle (EV) adoption across different dimensions, including EV types, vehicle make, model, and state-wise distribution. With 149,771 total EVs(2011 - 2023), the dataset highlights significant trends and insights about the growing EV market.
Download Dataset from kaggle: Dataset
- Battery Electric Vehicles (BEVs) account for 116,474 units, making up 77.8% of the total EVs whereas Plug-in Hybrid Electric Vehicles (PHEVs) contribute 33,297 units or 22.2% of the total EVs.
- EV adoption has grown significantly since 2011. Total vehicle registrations rose from 0.8K in 2011 to 37.1K in 2023, representing a 4,538% increase.
- Tesla dominates the market with 68,939 vehicles (57.6%) of the total EVs.
- 41.87% (62,711 vehicles) are CAFV eligible, and 11.77% (17,634 vehicles) are not CAFV eligible, indicating room for improvement in meeting CAFV standards.
- Diversify Model Offerings:
- Encourage other manufacturers to innovate and compete with Tesla to diversify market options.
- Promote more affordable models for increased accessibility.
- Range Optimization:
- Invest in R&D to improve EV range and address consumer concerns about distance limitations.
- Highlight vehicles with exceptional range to attract new customers.
I used data wrangling to covert raw data from initial format to a format that may be better for analysis and future model development.
- Handling missing values
- Correct data formatting
- Normalize data
- Identify missing values
- Identify missing values
- Deal with missing values
- Correct data format
- Data Normalization (centering/scaling)
- Binning
- Indicator Variable
Dataset: Dowload Dataset here
In this task, I developed several models that predicted the price of the car using the variables or features. This is just an estimate but should give us an objective idea of how much the car should cost. A model will help us understand the exact relationship between different variables and how these variables are used to predict the result.
- Develop prediction models
- Linear Regression and Multiple Linear Regression
- Model Evaluation using Visualization
- Polynomial Regression
- Pipeline
- Measure for In-Sample Evaluation
- Prediction and Decision Making
I delved deep into the fascinating world of festival sales to uncover valuable insights. from analyzing gender and occupation preferences to exploring the impact of marital status and age group on product categories, I thoroughly explored Diwali shopping trends to understand them completely.
The first part of this project is Excel: I used Excel to clean my dataset and make it ready for Visualization. Through this visualization, one can simply know what hero products are, for what age group, what kind of occupation, and location of customer.
The second part of my project is using of SQL: I rewind my SQL knowledge to clean the dataset and make it ready for analysis. Objectives: The objective of this process was to clean and prepare the raw Diwali sales dataset for analysis by addressing missing values, duplicates, and other inconsistencies.
- Target Marketing Campaigns:
- Focus on female customers and the 18–35 age group.
- Offer exclusive promotions for IT and healthcare professionals.
- Geographical Expansion:
- Strengthen presence in top-performing states while improving reach in underperforming regions.
- Product Strategy:
- Stock up on high-demand products like food and clothing during festive seasons.
- Promote low-performing categories like Tupperware and stationery through bundled offers.
- Loyalty Programs:
- Develop loyalty initiatives for top customers to ensure retention and increased lifetime value.
Using a variety of machine learning models such as Random Forest, K-Nearest Neighbor, Decision Tree and others, I analyzed the dataset and identified the Decision Tree models as the best-performing algorithm with an impressive accuracy of 88.9%.
Key Highlights:
- Data Analysis: Segmented the dataset into categorical and continuous variables for targeted analysis.
- Visualization Insights: Leveraged donut chart to analyze categorical data and highlight trends.
- Model Evaluation: Tested multiple models and compared their performance to select the most accurate predictor.
- Technical Stack: Python, Pandas, NumPy, Matplotlib, Seaborn, Scikit-Learn
The goal of analyzing transaction data is to get important insights regarding customer behavior, product popularity, sales patterns, and operational efficiencies. The purpose is to optimize inventory management, improve decision-making processes, and find possible cross-selling opportunities.
- Data collection, cleaning, and preparation
- Analyse monthly, daily, and hourly sales patterns with Power PivotTables.
- Determine high-performing days and times.
- Develop compelling reports and visualization
- Create an Interactive dashboard
- Excel
- Power BI
- MS SQL
- Top Sellers: Coffee leads with 25% of total sales, followed by Bakery and Tea.
- Store Performance: Our Astoria and Hell’s Kitchen locations are top performers, each contributing over 34% of total sales.
- Monthly Trends: We find the steady growth in each monthly continuously increasing their sales.
- Customer Habits: Peak sales times are between 8 AM and 11 AM on weekdays, aligning with the morning coffee rush.
- Product Trends: Coffee beans and branded products saw the highest month-on-month growth, while sustainably grown coffee remains our top-seller.
Peak transaction times, high-performing days, and areas for improvement are identified. By leveraging these findings,you can enhance customer experiences, refine inventory management, and boost overall sales efficiency for a thriving coffee shop venture.
This dashboard provides a comprehensive analysis of the bank’s loan performance, highlighting key metrics such as total loan applications, loan amounts, amount received, interest rates, and debt-to-income ratio (DTI).
- Total Loan Applications: 38.6K, with 4.3K applications received MTD (Month-to-Date), reflecting 6.9% growth.
- Total Loan Amount Funded: $435.8M, with $54.0M funded MTD, indicating 13.0% increase.
- Total Amount Received: $473.1M, with $58.1M received MTD, showing 15.8% increase.
- Average Interest Rate: 12.05%, slightly lower than MTD rate of 12.36%.
- Average Debt-to-Income Ratio (DTI): 13.33%, with an MTD reduction to 13.67% (-2.7%).
-
- Good Loan Issued: Strong Performance
- 85.9% of total loans are classified as good loans, demonstrating high-quality lending practices.
- 33.2K good loan applications resulted in a total funded amount of $370.2M, with $435.8M successfully received.
- Key Strengths:
- Strong repayment trends.
- Efficient fund disbursement and collection.
- High proportion of fully paid loans.
-
- Bad Loan Issued: Areas of Concern
- 14.1% of loans are bad loans, indicating some degree of credit risk.
- 5.3K applications resulted in a funded amount of $65.5M, but only $37.3M has been recovered, leading to potential losses.
- Key Concerns:
- Higher non-performing loan (NPL) ratio in certain segments.
- Potential need for stricter credit risk assessments.
-
- Reduce Bad Loan Issuance (14.1%)
- Implement stricter credit checks and borrower risk assessments.
- Strengthen loan approval processes to minimize high-risk lending.
- Offer financial literacy programs to borrowers.
-
- Improve Loan Recovery & Reduce Charge-Offs
- Enhance debt collection strategies.
- Introduce flexible repayment options for struggling borrowers.
- Use predictive analytics to detect potential defaults early.
-
- Optimize Interest Rates & DTI Ratio
- Adjust interest rates based on borrower credit scores and risk factors.
- Maintain DTI ratio below 13% to ensure borrower affordability.
-
- Enhance Loan Portfolio Diversification
- Expand lending to low-risk segments with stable income sources.
- Diversify loan products to cater to different borrower profiles.
-
- Increase Efficiency in Loan Disbursement & Collection
- Automate loan approval and repayment tracking for faster processing.
- Leverage AI-based fraud detection to prevent high-risk lending.
I've developed an interactive Tableau & Power BI dashboard that focuses on employee attrition trends. This dashboard provides a deep dive into attrition data, enabling HR teams to identify key factors contributing to turnover and take proactive steps to improve retention strategies.
Tableau :
This dashboard provides a clear and concise view for HR teams, aiding in decision-making process related to promotions, retrenchments, and workforce distribution.
This project involves analyzing a spotify dataset having various attributes about albums, tracks, streams, views, likes, artists and other components using SQL. It covers the end-to-end process of normalizing a denormalized dataset, performing SQL queries of varying complexity (easy, medium, and advanced), and optimizing query performance. The primary goal of the project is to practice advanced SQL skills and generate valuable insights from the dataset.
- Retrieve the names of all tracks that have more than 1 billion streams.
- List all albums along with their respective artists.
- Get the total number of comments for tracks where licensed = TRUE.
- Find all tracks that belong to the album type single.
- Count the total number of tracks by each artist.
- Calculate the average danceability of tracks in each album.
- Find the top 5 tracks with the highest energy values.
- List all tracks along with their views and likes where official_video = TRUE.
- For each album, calculate the total views of all associated tracks.
- Retrieve the track names that have been streamed on Spotify more than YouTube.
- Find the top 3 most-viewed tracks for each artist using window functions.
- Write a query to find tracks where the liveness score is above the average.
- Use a WITH clause to calculate the difference between the highest and lowest energy values for tracks in each album.
- Find tracks where the energy-to-liveness ratio is greater than 1.2.
- Calculate the cumulative sum of likes for tracks ordered by the number of views, using window functions.
- Data Extraction: Retrieved customer and transaction data from an SQL database.
- Data Preparation: Imported the cleaned data into Power BI for visualization.
- Data Visualization: Created interactive dashboards in Power BI to visualize key metrics and trends.
- Revenue increased by 28.8%
- Total Transaction Amount & Count rose by 35% & 12.8%
- Customer count grew by 12.8%
- Overall revenue: $56.5M
- Total interest: $8M
- Total transaction amount: $45.5M
- Male customers contributed $30.9M in revenue, while female customers contributed $25.6M
- Blue & Silver credit cards accounted for 93% of overall transactions
- TX, NY, & CA contributed 68% to the total transactions
- Overall Activation rate: 57.5%
- Overall Delinquent rate: 6.06%
- Quarterly Trends: Stable revenue across quarters, with slight decrease in Q3 and Q4.
- Expenditure Types: Highest revenue from bills, entertainment, fuel, groceries, and travel.
- Demographics: Higher revenue from graduates, High-School, businessmen, and white-collar workers.
- Card Types: Blue cardholders contribute the most revenue.
- Transaction Methods: Swipe transactions generate the most revenue.
- Customer Job: Businessmen contribute the most revenue followed by White-collar
- Developed an interactive dashboard using transaction and customer data from a SQL database, to provide real-time insights.
- Streamlined data processing & analysis to monitor key performance metrics and trends.
- Shared actionable insights with stakeholders based on dashboard findings to support decision-making process.
The Overview section should provide a snapshot of the overall HR metrics, including:
- Display the total number of hired employees, active employees, and terminated employees.
- Visualize the total number of hired and terminated employees over the years.
- Present a breakdown of total employees by department and job titles.
- Compare total employees between headquarters (HQ) and branches (New York is the HQ)
- Show the distribution of employees by city and state.
The Demographics section should offer insights into the composition of the workforce, including:
- Present the gender ratio in the company.
- Visualize the distribution of employees across age groups and education levels.
- Show the total number of employees within each age group.
- Show the total number of employees within each education level.
- Present the correlation between employees’s educational backgrounds and their performance ratings.
The income analysis section should focus on salary-related metrics, including:
- Compare salaries across different education levels for both genders to identify any discrepancies or patterns.
- Present how the age correlate with the salary for employees in each department.
A comprehensive collection of SQL scripts for data exploration, analytics, and reporting. These scripts cover various analyses such as database exploration, measures and metrics, time-based trends, cumulative analytics, segmentation, and more.
This repository contains Advace SQL queries like CTE, window function, subqueries, & report designed to help data analysts and BI professionals quickly explore, segment, and analyze data within a relational database. Each script focuses on a specific analytical theme and demonstrates best practices for SQL queries.
Cumulative Analysis
Purpose:
- To calculate running totals or moving averages for key metrics.
- To track performance over time cumulatively.
- Useful for growth analysis or identifying long-term trends.
SQL Functions Used:
- Window Functions: SUM() OVER(), AVG() OVER()
===============================================================================
-- Calculate the total sales per month and the running total of sales over time
SELECT
order_date,
total_sales,
SUM(total_sales) OVER (ORDER BY order_date) AS running_total_sales,
AVG(avg_price) OVER (ORDER BY order_date) AS moving_average_price
FROM
(
SELECT
DATETRUNC(year, order_date) AS order_date,
SUM(sales_amount) AS total_sales,
AVG(price) AS avg_price
FROM DataWarehouseAnalytic..[gold.fact_sales]
WHERE order_date IS NOT NULL
GROUP BY DATETRUNC(year, order_date)
) Y;
Performance Analysis (Year-over-Year, Month-over-Month)
Purpose:
- To measure the performance of products, customers, or regions over time.
- For benchmarking and identifying high-performing entities.
- To track yearly trends and growth.
SQL Functions Used:
- LAG(): Accesses data from previous rows.
- AVG() OVER(): Computes average values within partitions.
- CASE: Defines conditional logic for trend analysis.
===============================================================================
/* Analyze the yearly performance of products by comparing their sales
to both the average sales performance of the product and the previous year's sales */
WITH yearly_product_sales AS (
SELECT
YEAR(f.order_date) AS order_year,
p.product_name,
SUM(f.sales_amount) AS current_sales
FROM DataWarehouseAnalytic..[gold.fact_sales] f
LEFT JOIN DataWarehouseAnalytic..[gold.dim_products] p
ON f.product_key = p.product_key
WHERE f.order_date IS NOT NULL
GROUP BY
YEAR(f.order_date),
p.product_name
)
SELECT
order_year,
product_name,
current_sales,
AVG(current_sales) OVER (PARTITION BY product_name) AS avg_sales,
current_sales - AVG(current_sales) OVER (PARTITION BY product_name) AS diff_avg,
CASE
WHEN current_sales - AVG(current_sales) OVER (PARTITION BY product_name) > 0 THEN 'Above Avg'
WHEN current_sales - AVG(current_sales) OVER (PARTITION BY product_name) < 0 THEN 'Below Avg'
ELSE 'Avg'
END AS avg_change,
-- Year-over-Year Analysis
LAG(current_sales) OVER (PARTITION BY product_name ORDER BY order_year) AS py_sales,
current_sales - LAG(current_sales) OVER (PARTITION BY product_name ORDER BY order_year) AS diff_py,
CASE
WHEN current_sales - LAG(current_sales) OVER (PARTITION BY product_name ORDER BY order_year) > 0 THEN 'Increase'
WHEN current_sales - LAG(current_sales) OVER (PARTITION BY product_name ORDER BY order_year) < 0 THEN 'Decrease'
ELSE 'No Change'
END AS py_change
FROM yearly_product_sales
ORDER BY product_name, order_year;
Data Segmentation Analysis
Purpose:
- To group data into meaningful categories for targeted insights.
- For customer segmentation, product categorization, or regional analysis.
SQL Functions Used:
- CASE: Defines custom segmentation logic.
- GROUP BY: Groups data into segments.
===============================================================================
/*Segment products into cost ranges and count how many products fall into each segment*/
WITH product_segments AS (
SELECT
product_key,
product_name,
cost,
CASE
WHEN cost < 100 THEN 'Below 100'
WHEN cost BETWEEN 100 AND 500 THEN '100-500'
WHEN cost BETWEEN 500 AND 1000 THEN '500-1000'
ELSE 'Above 1000'
END AS cost_range
FROM DataWarehouseAnalytic..[gold.dim_products]
)
SELECT
cost_range,
COUNT(product_key) AS total_products
FROM product_segments
GROUP BY cost_range
ORDER BY total_products DESC;
/*Group customers into three segments based on their spending behavior:
- VIP: Customers with at least 12 months of history and spending more than €5,000.
- Regular: Customers with at least 12 months of history but spending €5,000 or less.
- New: Customers with a lifespan less than 12 months.
And find the total number of customers by each group
*/
WITH customer_spending AS (
SELECT
c.customer_key,
SUM(f.sales_amount) AS total_spending,
MIN(order_date) AS first_order,
MAX(order_date) AS last_order,
DATEDIFF(month, MIN(order_date), MAX(order_date)) AS lifespan
FROM DataWarehouseAnalytic..[gold.fact_sales] f
LEFT JOIN DataWarehouseAnalytic..[gold.dim_customers] c
ON f.customer_key = c.customer_key
GROUP BY c.customer_key
)
SELECT
customer_segment,
COUNT(customer_key) AS total_customers
FROM (
SELECT
customer_key,
CASE
WHEN lifespan >= 12 AND total_spending > 5000 THEN 'VIP'
WHEN lifespan >= 12 AND total_spending <= 5000 THEN 'Regular'
ELSE 'New'
END AS customer_segment
FROM customer_spending
) AS segmented_customers
GROUP BY customer_segment
ORDER BY total_customers DESC;
Part-to-Whole Analysis
/*
Purpose:
- To compare performance or metrics across dimensions or time periods.
- To evaluate differences between categories.
- Useful for A/B testing or regional comparisons.
SQL Functions Used:
- SUM(), AVG(): Aggregates values for comparison.
- Window Functions: SUM() OVER() for total calculations.
===============================================================================
*/
-- Which categories contribute the most to overall sales?
WITH category_sales AS (
SELECT
p.category,
SUM(f.sales_amount) AS total_sales
FROM DataWarehouseAnalytic..[gold.fact_sales] f
LEFT JOIN DataWarehouseAnalytic..[gold.dim_products] p
ON p.product_key = f.product_key
GROUP BY p.category
)
SELECT
category,
total_sales,
SUM(total_sales) OVER () AS overall_sales,
ROUND((CAST(total_sales AS FLOAT) / SUM(total_sales) OVER ()) * 100, 2) AS percentage_of_total
FROM category_sales
ORDER BY total_sales DESC;
Customer Report
Purpose:
- This report consolidates key customer metrics and behaviors
Highlights:
1. Gathers essential fields such as names, ages, and transaction details.
2. Segments customers into categories (VIP, Regular, New) and age groups.
3. Aggregates customer-level metrics:
- total orders
- total sales
- total quantity purchased
- total products
- lifespan (in months)
4. Calculates valuable KPIs:
- recency (months since last order)
- average order value
- average monthly spend
===============================================================================
CREATE VIEW gold.report_customers AS
WITH base_query AS(
/*---------------------------------------------------------------------------
1) Base Query: Retrieves core columns from tables
---------------------------------------------------------------------------*/
SELECT
f.order_number,
f.product_key,
f.order_date,
f.sales_amount,
f.quantity,
c.customer_key,
c.customer_number,
CONCAT(c.first_name, ' ', c.last_name) AS customer_name,
DATEDIFF(year, c.birthdate, GETDATE()) age
FROM DataWarehouseAnalytic..[gold.fact_sales] f
LEFT JOIN DataWarehouseAnalytic..[gold.dim_customers] c
ON c.customer_key = f.customer_key
WHERE order_date IS NOT NULL)
, customer_aggregation AS (
/*---------------------------------------------------------------------------
2) Customer Aggregations: Summarizes key metrics at the customer level
---------------------------------------------------------------------------*/
SELECT
customer_key,
customer_number,
customer_name,
age,
COUNT(DISTINCT order_number) AS total_orders,
SUM(sales_amount) AS total_sales,
SUM(quantity) AS total_quantity,
COUNT(DISTINCT product_key) AS total_products,
MAX(order_date) AS last_order_date,
DATEDIFF(month, MIN(order_date), MAX(order_date)) AS lifespan
FROM base_query
GROUP BY
customer_key,
customer_number,
customer_name,
age
)
SELECT
customer_key,
customer_number,
customer_name,
age,
CASE
WHEN age < 20 THEN 'Under 20'
WHEN age between 20 and 29 THEN '20-29'
WHEN age between 30 and 39 THEN '30-39'
WHEN age between 40 and 49 THEN '40-49'
ELSE '50 and above'
END AS age_group,
CASE
WHEN lifespan >= 12 AND total_sales > 5000 THEN 'VIP'
WHEN lifespan >= 12 AND total_sales <= 5000 THEN 'Regular'
ELSE 'New'
END AS customer_segment,
last_order_date,
DATEDIFF(month, last_order_date, GETDATE()) AS recency,
total_orders,
total_sales,
total_quantity,
total_products
lifespan,
-- Compuate average order value (AVO)
CASE WHEN total_sales = 0 THEN 0
ELSE total_sales / total_orders
END AS avg_order_value,
-- Compuate average monthly spend
CASE WHEN lifespan = 0 THEN total_sales
ELSE total_sales / lifespan
END AS avg_monthly_spend
FROM customer_aggregation