Technologies used: Apache Kafka, Spark Structured Streaming, Confluent Cloud, Databricks, Delta Lake, Spark NLP
All details of the project is described in HERE.
The aim of the Starbucks Twitter Sentimental Analysis project is to build end-to-end twitter data streaming pipeline to analyze brand sentiment analysis.
- Set up the Virtual Environment
pip install virtualenv
virtualenv --version # test your installation
virtualenv ccloud-venv
As we performed in the previous post, we need to get Twitter API Credentials. After getting it, we save these credential information in .env
file.
Make sure to include .env
file in .gitignore
to be ignored in the future.
# .env
CONSUMER_KEY = "<api key>"
CONSUMER_SECRET = "<api secret>"
ACCESS_TOKEN_KEY = "<access key>"
ACCESS_TOKEN_SECRET = "<access secret>"
Confluent Cloud is a resilient, scalable streaming data service based on Apache Kafka®, delivered as a fully managed service - Confluent Cloud. It offers users to manage cluster resources easily.
First, create a free Confluent Cloud account and create a kafka cluster in Confluent Cloud. I created a basic cluster which supports single zone availability with aws
cloud provider.
From the navigation menu, click Topics
, and in the Topics page, click Create topic
. I set topic name as tweet_data
with 2 partitions, the topic created on the Kafka cluster will be available for use by producers and consumers.
From the navigation menu, click API keys
under Data Integration
. If there is no available API Keys
, click add key
to get a new API keys (API_KEY, API_SECRET) and make sure to save it somewhere safe.
From the navigation menu, click Cluster settings
under Cluster Overview
. You can find Identification
block which contains the information of Bootstrap server
. Make sure to save it somewhere safe. It should be similar to pkc-w12qj.ap-southeast-1.aws.confluent.cloud:9092
HOST = pkc-w12qj.ap-southeast-1.aws.confluent.cloud
vi $HOME/.confluent/python.config
Press i
and copy&paste the file below !
#kafka
bootstrap.servers={HOST}:9092
security.protocol=SASL_SSL
sasl.mechanisms=PLAIN
sasl.username={API_KEY}
sasl.password={API_SECRET}
Then, replace HOST, API_KEY, API_SECRET with the values from Step 3
. Press :wq
to save the file.
Check HERE FOR the procedure of creating a Databricks Cluster
# Dockerfile
FROM python:3.7-slim
COPY requirements.txt /tmp/requirements.txt
RUN pip3 install -U -r /tmp/requirements.txt
COPY producer/ /producer
CMD [ "python3", "producer/producer.py",
"-f", "/root/.confluent/librdkafka.config",
"-t", "<your-kafka-topic-name>" ]
# cd <your-project_folder>
# source ./ccloud-venv/bin/activate
bash run.sh
Click here to check the presentation file