In this repository, I will show how to analyze the geo-coded social media data posted in Hong Kong. The general procedure is the following:
Tweet filtering. For more information, please check the following Jupyter notebooks:
Tweet text preprocessing
- Please check the clean the text sample notebook for how to get the raw Chinese tweet text
- Please check the tweet cleaning notebook to know how we clean, translate and preprocess the tweet for this work
Generate tweet representation using FastText word embedding based on sentiment140
Manually label the sentiment of 5000 tweets randomly sampled from our tweet dataset
Build Sentiment analysis classifiers and conduct cross validation. To check how to train the word embedding model based on sentiment140, please check the train_word_vectors_from_sentiment140 folder. To generate the tweet representation for each tweet of our own dataset, please visit the emoji2vec notebook or the code
Cross sectional analysis and longitudinal analysis
Difference-in-difference analysis
Result visualization(word cloud, topic modelling, etc)
In this project, I am using Python 3.5 to analyze the tweets. You could install all relevant packages by running the following code in the command line:
pip install -r requirements.txt
However, in the transit_non_transit_comparison folder, you need the ArcPy package to do the geographical analysis. This package is only supported in Python 2+ and could only be imported after downloading the ArcGIS.
To be continued.....