Real-Time E-commerce Streaming Pipeline

A comprehensive Big Data project demonstrating real-time ingestion, processing, and visualization of high-velocity sales data.

Project Overview

This project simulates a high-traffic e-commerce platform (specializing in TV sales) and builds an end-to-end streaming data pipeline. The goal was to master the architecture of modern real-time data systems, moving from raw event generation to actionable business intelligence dashboards in milliseconds.

The system generates realistic transaction data, streams it through Apache Kafka (Confluent Cloud), ingests it into Apache Druid running on AWS EC2 for sub-second queries, and visualizes KPIs on Grafana.

Architecture

The pipeline consists of four main stages:

Ingestion Source (Python): A custom script simulates user behavior and purchases based on a real-world dataset (TV_DATASET_USA.csv). It enriches data with geolocation, payment status, and user agent details before serializing it to JSON.
Message Broker (Kafka): Data is streamed to a specialized topic on Confluent Cloud, ensuring durability and decoupling the producer from consumers.
Real-Time Analytics (Druid): An Apache Druid cluster deployed on an AWS EC2 instance consumes the Kafka stream. It indexes data in real-time
Visualization (Grafana): A dashboard connects to Druid via SQL/JSON to display live metrics like "Revenue per Second", "Top Selling Brands", and "Failed Transactions".

How to Run

Prerequisites

Python 3.8+
AWS Account (for EC2)
Confluent Cloud Account
connected to your Druid EC2 server
connected to your Grafana instance

Steps

Clone the repo:

git clone https://github.com/hachemboudoukha/bestbuy-streaming-pipeline.git
cd bestbuy-streaming-pipeline

Install Dependencies:
```
pip install pandas confluent-kafka
```
Configure Credentials: Copy the example environment file:
```
cp .env.example .env
```
Open .env and fill in your Confluent Cloud details (BOOTSTRAP_SERVERS, SASL_USERNAME, SASL_PASSWORD).
Start the Stream:
```
python scripts/kafka_producer.py
```
Visualize: Open your Grafana instance (connected to your Druid EC2 server) and watch the metrics update in real-time!

Sample Analytics (SQL)

The system supports powerful SQL queries on the stream:

-- Calculate real-time revenue by Brand for the last hour
SELECT 
  "brand", 
  SUM("total_amount") as "total_sales"
FROM "ecommerce-topic-1"
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR
GROUP BY "brand"
ORDER BY "total_sales" DESC

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
data		data
diagrams		diagrams
docs		docs
grafana		grafana
images		images
notebooks		notebooks
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Real-Time E-commerce Streaming Pipeline

Project Overview

Architecture

How to Run

Prerequisites

Steps

Sample Analytics (SQL)

About

Uh oh!

Releases

Packages

Languages

hachemboudoukha/bestbuy-streaming-pipeline

Folders and files

Latest commit

History

Repository files navigation

Real-Time E-commerce Streaming Pipeline

Project Overview

Architecture

How to Run

Prerequisites

Steps

Sample Analytics (SQL)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages