Skip to content

This project will take user input of keywords and pages and fetch the tweets based on those keywords from Twitter, filter hashtags from those tweets and give those hashtags to spark for processing. After that it will launch a flask server on localhost:5001 to view the data in visual dashboard.

License

Notifications You must be signed in to change notification settings

runskmr/Spark-tweet

 
 

Repository files navigation

Spark tweet analysis

This is Our Project for the Cloud Computing Course. This project will take user input of keywords and pages and fetch the tweets based on those keywords from Twitter, filter hashtags from those tweets and give those hashtags to spark for processing. After that it will launch a flask server on localhost:5001 to view the data in visual dashboard.

Objective

To create a program using Apache Spark, an online streaming Real-Time Analytics Platform, to process the tweets and identifying the trending hashtags from Twitter based on a certain keyword and, finally, retrieve top hashtags by representing the data in a real-time dashboard.

Limitations

  • 450 queries per 15 minutes (enforced by twitter APIv2) . see here
  • 500K queries per month(enforced by twitter APIv2) . see here
  • We cannot get general tweets from Twitter. We have to get tweets based on some keywords (enforced by twitter APIv2)

Getting API keys from twitter.

The dataset used for this project is Twitter tweets. So, to get the Twitter tweets, we need access to Twitter API.

  • Go to the developer portal dashboard
  • Sign in with your developer account
  • Create a new project, give it a name, a use-case based on the goal you want to achieve, and a description.
  • choose ‘create a new App instead’ and give your App a name in order to create a new App
  • If everything is successful, you should be able to see page containing your keys and tokens, we will use Bearer token to access the API.
  • Make a new file keys.txt and in it put the bearer token in below format.
    token:<your_token_here>
    Make sure there are no spaces between token & : and : & <your_token>

Working of the project:

  • First, We retrieve tweets from Twitter using the Twitter APIv2.
  • The tweets are based on keywords that user specifies. (see running the app section)
  • The data is processed with the pyspark and hashtags are separated from tweets.
  • Then we send tweets through a TCP Socket to spark.
  • Using Apache spark, we process those trending hashtags.
  • To display the data in a visual representation, we are using flask web app.

Running the Application

First steps...

  • Java version should be compatible with pyspark. Current version of pyspark is 3.2.0 and only java version 11 is compatible. You can check java version by running command java --version. Make sure to have only compatible java version installed.
  • git clone https://github.com/HritwikSinghal/Spark-tweet.git
  • cd Spark-tweet
  • pip install -r ./requirements.txt

Now...

1. Automatic run

Simply run run.sh. if you want the defaults. The defaults are :

  • keywords = "corona bitcoin gaming Android climate cricket"
  • pages = 15 (per keyword)

Note that this will open the browser window and will kill the app after 4 minutes. (this will not happen if you use manual run, although you can modify run.sh to change this behaviour)

2. Manual run

Run the Programs in the order. NOTE: Every step should be run in new terminal

  1. Flask Application python3 ./app.py

  export PYSPARK_PYTHON=python3
  export SPARK_LOCAL_HOSTNAME=localhost
  python3 ./spark_app.py
  1. python3 ./twitter_app.py -p _<no_of_pages>_ -k _<"keywords">_

Replace _<"keywords">_ with the keywords you want to search (Note that keywords should be in quotes, like "corona bitcoin gaming Android")

and <no_of_pages> with the number of pages you want for each keyword from twitter.

Visual representation

You can access the real-time data in visual representation by accessing this URL given below.

http://localhost:5001/ 

or

http://127.0.0.1/5001

Stopping the application

run killall python3 in new terminal

Final Output

TBD

About

This project will take user input of keywords and pages and fetch the tweets based on those keywords from Twitter, filter hashtags from those tweets and give those hashtags to spark for processing. After that it will launch a flask server on localhost:5001 to view the data in visual dashboard.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 64.6%
  • HTML 30.0%
  • CSS 2.8%
  • Shell 2.6%