Teen Phone Addiction - Spark Streaming Pipeline

A real-time data pipeline analyzing teen phone usage patterns and addiction risks. This project ingests CSV data, streams it through Kafka, processes/enriches it with Apache Spark Structured Streaming, and stores the insights in PostgreSQL.

Architecture

The pipeline consists of the following components orchestrated via Docker Compose:

Zookeeper: Manages the Kafka cluster state.
Kafka: Serves as the message broker for real-time data.
Spark Producer: Reads raw CSV data and publishes messages to Kafka.
Spark Consumer:
PostgreSQL: Persistent storage for the processed analytics data.

Getting Started

Prerequisites

Docker and Docker Compose installed on your machine.

Installation & Run

Clone the repository:

git clone <repository-url>
cd spark-streaming-end-to-end

Start the services: This command will build the Spark applications and start all containers in the background.
```
docker-compose up --build -d
```

Check the logs: Monitor the producer and consumer to see the pipeline in action.

# View Producer logs (sending data)
docker-compose logs -f spark-producer

# View Consumer logs (processing and saving data)
docker-compose logs -f spark-consumer

Verify Data in PostgreSQL: You can connect to the Postgres database to query the results.

docker exec -it postgres psql -U postgres -d teen_addiction_db

# Inside psql shell:
SELECT * FROM teen_phone_data LIMIT 10;

Stop the application:
```
docker-compose down
```

Project Structure

.
├── config/              # (Deprecated/Internal) Config files
├── data/
│   └── *.csv           # Source dataset
├── sql/
│   └── schema.sql      # Database initialization script
├── src/main/scala/
│   ├── producer/       # Producer application code
│   └── consumer/       # Consumer application code
├── build.sbt           # Scala Build Tool configuration
├── Dockerfile          # Multi-stage Docker build for Spark apps
└── docker-compose.yaml # Orchestration of all services

Data & Processing

The pipeline calculates a Risk Score and Health Category (Low/Moderate/High) based on:

Daily usage hours
Sleep hours (calculating sleep deficit)
Physical exercise
Bedtime screen usage

Visualization (Power BI)

To visualize the real-time insights processed by the pipeline, you can connect Power BI to the PostgreSQL database.

1. Connection Settings

Connect Power BI Desktop to PostgreSQL using the following credentials:

Server: localhost (if running locally)
Port: 5432
Database: teen_addiction_db
Authentication: Database
User: postgres
Password: postgrespw (or check your .env)

2. Recommended Data Source

We have prepared a Materialized View optimized for visualization:

Table/View: teen_addiction_summary
Connectivity Mode:
- DirectQuery: For real-time updates (recommended).
- Import: For better performance with static snapshots.

3. Key Metrics & Visuals

Addiction Trends: Line chart using time_bucket (X-axis) and avg_risk_score (Y-axis).
Demographics: Pie chart for Gender distribution.
Risk Analysis: Stacked bar chart for Health_Category by Age.
KPI Cards: Displaying Avg Daily Usage and Avg Sleep Hours.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
data		data
project		project
sql		sql
src/main/scala		src/main/scala
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
build.sbt		build.sbt
cleanup.sh		cleanup.sh
docker-compose.app.yaml		docker-compose.app.yaml
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Teen Phone Addiction - Spark Streaming Pipeline

Architecture

Getting Started

Prerequisites

Installation & Run

Project Structure

Data & Processing

Visualization (Power BI)

1. Connection Settings

2. Recommended Data Source

3. Key Metrics & Visuals

About

Uh oh!

Releases

Packages

Languages

hachemboudoukha/spark-streaming-end-to-end

Folders and files

Latest commit

History

Repository files navigation

Teen Phone Addiction - Spark Streaming Pipeline

Architecture

Getting Started

Prerequisites

Installation & Run

Project Structure

Data & Processing

Visualization (Power BI)

1. Connection Settings

2. Recommended Data Source

3. Key Metrics & Visuals

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages