A real-time data pipeline analyzing teen phone usage patterns and addiction risks. This project ingests CSV data, streams it through Kafka, processes/enriches it with Apache Spark Structured Streaming, and stores the insights in PostgreSQL.
The pipeline consists of the following components orchestrated via Docker Compose:
- Zookeeper: Manages the Kafka cluster state.
- Kafka: Serves as the message broker for real-time data.
- Spark Producer: Reads raw CSV data and publishes messages to Kafka.
- Spark Consumer:
- PostgreSQL: Persistent storage for the processed analytics data.
- Docker and Docker Compose installed on your machine.
-
Clone the repository:
git clone <repository-url> cd spark-streaming-end-to-end
-
Start the services: This command will build the Spark applications and start all containers in the background.
docker-compose up --build -d
-
Check the logs: Monitor the producer and consumer to see the pipeline in action.
# View Producer logs (sending data) docker-compose logs -f spark-producer # View Consumer logs (processing and saving data) docker-compose logs -f spark-consumer
-
Verify Data in PostgreSQL: You can connect to the Postgres database to query the results.
docker exec -it postgres psql -U postgres -d teen_addiction_db # Inside psql shell: SELECT * FROM teen_phone_data LIMIT 10;
-
Stop the application:
docker-compose down
.
├── config/ # (Deprecated/Internal) Config files
├── data/
│ └── *.csv # Source dataset
├── sql/
│ └── schema.sql # Database initialization script
├── src/main/scala/
│ ├── producer/ # Producer application code
│ └── consumer/ # Consumer application code
├── build.sbt # Scala Build Tool configuration
├── Dockerfile # Multi-stage Docker build for Spark apps
└── docker-compose.yaml # Orchestration of all services
The pipeline calculates a Risk Score and Health Category (Low/Moderate/High) based on:
- Daily usage hours
- Sleep hours (calculating sleep deficit)
- Physical exercise
- Bedtime screen usage
To visualize the real-time insights processed by the pipeline, you can connect Power BI to the PostgreSQL database.
Connect Power BI Desktop to PostgreSQL using the following credentials:
- Server:
localhost(if running locally) - Port:
5432 - Database:
teen_addiction_db - Authentication: Database
- User:
postgres - Password:
postgrespw(or check your.env)
We have prepared a Materialized View optimized for visualization:
- Table/View:
teen_addiction_summary - Connectivity Mode:
- DirectQuery: For real-time updates (recommended).
- Import: For better performance with static snapshots.
- Addiction Trends: Line chart using
time_bucket(X-axis) andavg_risk_score(Y-axis). - Demographics: Pie chart for
Genderdistribution. - Risk Analysis: Stacked bar chart for
Health_CategorybyAge. - KPI Cards: Displaying
Avg Daily UsageandAvg Sleep Hours.