Skip to content

hachemboudoukha/spark-streaming-end-to-end

Repository files navigation

Teen Phone Addiction - Spark Streaming Pipeline

Spark Kafka Scala PostgreSQL Docker

A real-time data pipeline analyzing teen phone usage patterns and addiction risks. This project ingests CSV data, streams it through Kafka, processes/enriches it with Apache Spark Structured Streaming, and stores the insights in PostgreSQL.

Architecture

The pipeline consists of the following components orchestrated via Docker Compose:

  1. Zookeeper: Manages the Kafka cluster state.
  2. Kafka: Serves as the message broker for real-time data.
  3. Spark Producer: Reads raw CSV data and publishes messages to Kafka.
  4. Spark Consumer:
  5. PostgreSQL: Persistent storage for the processed analytics data.

Getting Started

Prerequisites

Installation & Run

  1. Clone the repository:

    git clone <repository-url>
    cd spark-streaming-end-to-end
  2. Start the services: This command will build the Spark applications and start all containers in the background.

    docker-compose up --build -d
  3. Check the logs: Monitor the producer and consumer to see the pipeline in action.

    # View Producer logs (sending data)
    docker-compose logs -f spark-producer
    
    # View Consumer logs (processing and saving data)
    docker-compose logs -f spark-consumer
  4. Verify Data in PostgreSQL: You can connect to the Postgres database to query the results.

    docker exec -it postgres psql -U postgres -d teen_addiction_db
    
    # Inside psql shell:
    SELECT * FROM teen_phone_data LIMIT 10;
  5. Stop the application:

    docker-compose down

Project Structure

.
├── config/              # (Deprecated/Internal) Config files
├── data/
│   └── *.csv           # Source dataset
├── sql/
│   └── schema.sql      # Database initialization script
├── src/main/scala/
│   ├── producer/       # Producer application code
│   └── consumer/       # Consumer application code
├── build.sbt           # Scala Build Tool configuration
├── Dockerfile          # Multi-stage Docker build for Spark apps
└── docker-compose.yaml # Orchestration of all services

Data & Processing

The pipeline calculates a Risk Score and Health Category (Low/Moderate/High) based on:

  • Daily usage hours
  • Sleep hours (calculating sleep deficit)
  • Physical exercise
  • Bedtime screen usage

Visualization (Power BI)

To visualize the real-time insights processed by the pipeline, you can connect Power BI to the PostgreSQL database.

1. Connection Settings

Connect Power BI Desktop to PostgreSQL using the following credentials:

  • Server: localhost (if running locally)
  • Port: 5432
  • Database: teen_addiction_db
  • Authentication: Database
  • User: postgres
  • Password: postgrespw (or check your .env)

2. Recommended Data Source

We have prepared a Materialized View optimized for visualization:

  • Table/View: teen_addiction_summary
  • Connectivity Mode:
    • DirectQuery: For real-time updates (recommended).
    • Import: For better performance with static snapshots.

3. Key Metrics & Visuals

  • Addiction Trends: Line chart using time_bucket (X-axis) and avg_risk_score (Y-axis).
  • Demographics: Pie chart for Gender distribution.
  • Risk Analysis: Stacked bar chart for Health_Category by Age.
  • KPI Cards: Displaying Avg Daily Usage and Avg Sleep Hours.

About

spark-streaming-end-to-end

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published