A comprehensive Big Data project demonstrating real-time ingestion, processing, and visualization of high-velocity sales data.
This project simulates a high-traffic e-commerce platform (specializing in TV sales) and builds an end-to-end streaming data pipeline. The goal was to master the architecture of modern real-time data systems, moving from raw event generation to actionable business intelligence dashboards in milliseconds.
The system generates realistic transaction data, streams it through Apache Kafka (Confluent Cloud), ingests it into Apache Druid running on AWS EC2 for sub-second queries, and visualizes KPIs on Grafana.
The pipeline consists of four main stages:
- Ingestion Source (Python): A custom script simulates user behavior and purchases based on a real-world dataset (
TV_DATASET_USA.csv). It enriches data with geolocation, payment status, and user agent details before serializing it to JSON. - Message Broker (Kafka): Data is streamed to a specialized topic on Confluent Cloud, ensuring durability and decoupling the producer from consumers.
- Real-Time Analytics (Druid): An Apache Druid cluster deployed on an AWS EC2 instance consumes the Kafka stream. It indexes data in real-time
- Visualization (Grafana): A dashboard connects to Druid via SQL/JSON to display live metrics like "Revenue per Second", "Top Selling Brands", and "Failed Transactions".
- Python 3.8+
- AWS Account (for EC2)
- Confluent Cloud Account
- connected to your Druid EC2 server
- connected to your Grafana instance
-
Clone the repo:
git clone https://github.com/hachemboudoukha/bestbuy-streaming-pipeline.git cd bestbuy-streaming-pipeline -
Install Dependencies:
pip install pandas confluent-kafka
-
Configure Credentials: Copy the example environment file:
cp .env.example .env
Open
.envand fill in your Confluent Cloud details (BOOTSTRAP_SERVERS, SASL_USERNAME, SASL_PASSWORD). -
Start the Stream:
python scripts/kafka_producer.py
-
Visualize: Open your Grafana instance (connected to your Druid EC2 server) and watch the metrics update in real-time!
The system supports powerful SQL queries on the stream:
-- Calculate real-time revenue by Brand for the last hour
SELECT
"brand",
SUM("total_amount") as "total_sales"
FROM "ecommerce-topic-1"
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' HOUR
GROUP BY "brand"
ORDER BY "total_sales" DESC