In this repository, I will show how to analyze the geo-coded social media data posted in Hong Kong. The general procedure is the following:
-
Tweet filtering. For more information, please check the following Jupyter notebooks:
-
Tweet text preprocessing
- Please check the clean the text sample notebook for how to get the raw Chinese tweet text
- Please check the tweet cleaning notebook to know how we clean, translate and preprocess the tweet for this work
-
Generate tweet representation using FastText word embedding based on sentiment140
-
Manually label the sentiment of 5000 tweets randomly sampled from our tweet dataset
-
Build Sentiment analysis classifiers and conduct cross validation. To check how to train the word embedding model based on sentiment140, please check the train_word_vectors_from_sentiment140 folder. To generate the tweet representation for each tweet of our own dataset, please visit the emoji2vec notebook or the code get_tweet_representation.py
-
Cross sectional analysis and longitudinal analysis
-
Difference-in-difference analysis
-
Result visualization(word cloud, topic modelling, etc)
In this project, I am using Python 3.5 to analyze the tweets. You could install all relevant packages by running the following code in the command line:
pip install -r requirements.txt
However, in the transit_non_transit_comparison folder, you need the ArcPy package to do the geographical analysis. This package is only supported in Python 2+ and could only be imported after downloading the ArcGIS.
To be continued.....