Skip to content

Commit 8eb8484

Browse files
committed
update README
1 parent c60fc30 commit 8eb8484

File tree

2 files changed

+195
-103
lines changed

2 files changed

+195
-103
lines changed

readme.md

+80-103
Original file line numberDiff line numberDiff line change
@@ -1,123 +1,100 @@
11
# Twitter and Foursquare data collection and processing modules
22

3-
This is a collection of python code utilized in a research paper regarding topic modelling of texts ("shouts") from a location-based social network (Foursquare). Data is gathered initially from Twitter tweets gathered from Twitter Streaming API, which are then cross-referenced using Foursquare/Swarm API to retrieve the original shout text. Shouts are then filtered, processed, tokenized, and analyzed using LDA (Latent Dirichlet Allocation) algorithm. The generated LDA model can then be visualized using several techniques. The code assume input data of shouts/check-ins each associated with a user and a "venue." Each venue is a place with name, coordinates, and venue_id. Visualization techniques are primarily used to compare different venues to each other.
3+
This is a "renovated" collection of python code utilized in a research paper regarding topic modelling of texts ("shouts") from a location-based social network (Foursquare). Data is gathered initially from Twitter tweets gathered from Twitter Streaming API, which are then cross-referenced using Foursquare/Swarm API to retrieve the original shout text. Shouts are then filtered, processed, tokenized, and analyzed using LDA (Latent Dirichlet Allocation) algorithm. The generated LDA model can then be visualized using several techniques. The code assume input data of shouts/check-ins each associated with a user and a "venue." Each venue is a place with name, coordinates, and venue_id. Visualization techniques are primarily used to compare different venues to each other.
4+
5+
With this restructured code, user can choose to run LDA algorithm on arbitrary collection of texts, or use the original workflow involving Foursquare's venues information with its visualization techniques.
46

57
## Requirements
68

79
1. Python version 2.7
810
2. Other dependencies listed in requirements.txt
11+
3. Twitter API access authentications for data scraping (if need to scrape twitter data)
12+
4. Foursquare API access token for checkins referencing (if work with foursquare data)
913

1014
## Installation
1115

12-
1. Clone repository or download the whole directory and change working directory into the root directory of project
13-
2. Use pip or virtualenv to install dependency listed in requirements.txt
16+
1. Clone repository or download the whole directory and change working directory into the root directory of project.
17+
2. Use pip or virtualenv to install dependency listed in requirements.txt.
18+
3. Create "twitterAuth.json" file containing a json with necessary authentication for Twitter Streaming API. (see format below in Configuration section)
19+
20+
## Typical Usage
21+
22+
Some typical scenarios that this module can be used for are listed here with ther corresponding scripts.
23+
24+
1. Scraping twitter data from Streaming API. (`twitterStreamClient.py`)
25+
2. Filtering twitter data further, in case you are a cute little paranoid. (`filter.py`)
26+
3. Reference appropriate twitter data of shouts from Foursquare with the Foursquare API to get venues information. (`twitter2foursquare.py`)
27+
4. Run LDA algorithm on some collection of texts. (`myDriver.py` or customized driver)
28+
5. Run LDA algorithm on the Foursquare shouts and venues information. (`myDriver.py` or customized driver)
29+
6. Visualize the learnt LDA model with `pyLDAvis` (`myDriver.py` or customized driver)
30+
7. Visualize the Foursquare data with various techniques. (`myDriver.py` or customized driver)
31+
32+
## Configurations
33+
34+
### Database
35+
36+
Change the name of SQLite database file (`.sqlite`) to use in `twitterLda/sqlite_models.py` code directly. Default is `10-27.sqlite`, the original database used in the student's paper.
37+
38+
### Project path
39+
40+
Some paths of the project folder structure are defined in `twitterLda/projectPath.py`. Configure according to your need.
41+
42+
### Twitter Streaming API filter
43+
44+
The `filter` endpoint of the Streaming API is used to scrape data. parameters used to filter the tweets should be configured directly in the file `twitterStreamClient.py`
45+
46+
### twitterAuth.json
47+
48+
Authentication file for accessing Twitter Streaming API. Must contain a JSON with fields `"consumerKey"`, `"consumerSecret"`, `"token"`, and `"tokenSecret"`.
49+
50+
Sample file:
51+
52+
```json
53+
{
54+
"consumerKey": "ZIJyJlEMxxxxxxxzrPbqLFSnO",
55+
"consumerSecret": "Xa1Vq8Ze5i7xxxxxxxxxxxxxxxxxmWekiAGPXEoXinhp5opE2",
56+
"token": "3264530906-hJ0arh1odxxxxxxxxxxxxxxygzA91WNf0",
57+
"tokenSecret": "VsmPmWdYZ3R2xxxxxxxxxxxxxxxxxxxxxxWpj847y6ms"
58+
}
59+
```
60+
61+
### Foursquare API key
62+
63+
...is specified in the source file `twitter2foursquare.py` directly. Please customize according to your usage. In case you're wondering, no, that's not my own key, and I know making this available to public is stupid. I don't care about the original author who subjected me to endless hours of code refactoring involving a spaghetti ball of convoluted dependencies and overassumptions of usage. Now, I have to wonder whether my effots will simply vanish in vain due to negligence of somebody...
64+
65+
### LDA driver script
66+
67+
The LDA (Laten Dirichlet Allocation) algorithm library is called by a driver script executed from the root directoty. The example code, `myDriver.py` demonstrate basic funcionality of the driver. The driver can be separated into 2 parts: LDA model and visualization.
68+
69+
#### LDA model
70+
71+
LDA model and corpus learning or loading is controlled by several parameters passed to make the driver object: `LdaDriver` from package `twitterLda.lda_driver`. All available parameters are as follow
72+
73+
- project_name: Name of the project. All corpus and model are stored in the directory `data/ldaProjects/{project_name}`
74+
- corpus_type: Type of corpus to used (must be a string one of `twokenize`, `gensim`, or `tweet`).
75+
- num_topics: Number of topics assumed to be present in the corpus.
76+
- num_passes: Number of passes to run the lDA algorithm.
77+
- alpha: Type of alpha to used to learn LDA (must be one of `symetric` or `auto`).
78+
- docIterFunc: Function returning a generator yielding a document from the corpus each time it is iterated. This function will be called multiple times, with each time producing a generator starting at the first document. Several "IterFunc"s are available in package `twitterLda.fileReader`
79+
- make_corpus: (`True, False`) Choose to extract new corpus from the documents or not. If set to `False`, a generated corpus will be used.
80+
- make_lda: (`True, False`) Choose to learn a new LDA model or not. If set to `False`, a learnt model will be loaded.
81+
- make_venues: (`True, False`) ___*Only for shouts information generated by `twitter2foursquare.py` or an output from Foursquare API.___ Choose to generate the database and index of venues present in the dataset. Used for visualizing venues data.
82+
83+
#### Visualization
84+
85+
##### pyLDAvis
1486

15-
## Hyper parameters
87+
A library to visualize any arbitrary model in browser.
1688

17-
- Tokenization library to use (Twokenize, Gensim, or NLTK Tweet Tokenizer) => (twokenize, gensim, tweet)
18-
- no. of topics
19-
- no. of passes to run LDA
20-
- Type of alpha (auto, symmetric)
89+
##### Venue visualization
2190

22-
## Visualizations
91+
Available visualization techniques:
2392

2493
- Multidimensional scaling (MDS)
2594
- Distance matrix (Hellinger Distance)
2695
- matplotlib
2796
- text
28-
- Week-days distribution of Hellinger Distance (?)
97+
- Week-days distribution of Hellinger Distance (?)
2998
- Top terms for each topic
3099

31-
## Modules
32-
33-
Basically a flow of information from top to bottom
34-
35-
___Most files contain hardcoded directory names. Users should make sure that the names are correct before use.___
36-
37-
#### scrape.py
38-
39-
- Collect data from Twitter Streaming API of tweets that
40-
- are English
41-
- contain terms "4sq", "foursquare", or "swarmapp" (hardcoded)
42-
- Usage:
43-
- edit customer keys, customer secret, token, token secret in the source file
44-
- can also edit output file name
45-
- run script as main
46-
47-
#### filter.py
48-
49-
- filter only tweets that
50-
- are English
51-
- have geo data
52-
- in California
53-
- also split all tweets into files as chunks of 500 tweets each to be used in next step
54-
- Usage:
55-
- edit file names and directories in file (hardcoded)
56-
- run script as main
57-
58-
#### twitter2foursquare.py
59-
60-
- read tweets and cross-reference them with Foursquare API shouts data
61-
- 500 requests are made per hour as per Foursquare's API limit
62-
- Usage:
63-
- edit access token in the file
64-
- edit file names and directories in file (hardcoded)
65-
- run script as main
66-
67-
#### transform_shouts.py
68-
69-
- select only important fields from shout data
70-
- Usage:
71-
- edit file names and run this thing
72-
73-
#### sqlite_models.py
74-
75-
- contain definition for sqlite database model used to store venues information
76-
- also include code to create the database
77-
- Usage:
78-
- edit file names to read transformed shouts file and run this thing once to insert data into database
79-
80-
#### sqlite_quries.py
81-
82-
- contain scripts to access database and gather information
83-
- also include code to generate an *essential* collection of text files containing data from each venue
84-
- will generate a file for each venue in the /data/ven folder
85-
- Usage:
86-
- edit file names and run this thing once after a set of data collection to generate files from data base!
87-
88-
#### lda_driver.py
89-
90-
- provide a class "LdaDriver" used to train an LDA model with shouts data collected
91-
- also house the visualization part
92-
- Usage:
93-
- create LdaDriver object with desired parameters
94-
- corpus_type: tokenizer to user (twokenize, gensim, tweet)
95-
- num_topics: number of topics
96-
- num_passes: number of passes
97-
- alpha: type of alpha to use in learning (symmetric, auto)
98-
- make_corpus: create a corpus for the tokenizer or not (True, False) (use this when using a new tokenizer or new dataset)
99-
- make_lda: train LDA or not (True, False) (use this when try a new set of parameters; can set to False if just want to visualize)
100-
- example:
101-
```python
102-
driver_settings = {'corpus_type':'twokenize',
103-
'num_topics':30,
104-
'num_passes':20,
105-
'alpha':'symmetric',
106-
'make_corpus':False,
107-
'make_lda':True}
108-
driver = LdaDriver(**driver_settings)
109-
```
110-
- call methods of LdaDriver for different visualizations
111-
- .vis_heatmap
112-
- .vis_MDS
113-
- .vis_ldavis
114-
- .print_dist_matrix
115-
- also functions from the namespace (wow, fancy!)
116-
- process_temporal_shouts_by_weekday(venue_id)
117-
- example:
118-
```python
119-
driver.vis_heatmap(driver.dist_matrix, [ven.name for ven in driver.vens])
120-
driver.vis_MDS(driver.dist_matrix, [ven.name for ven in driver.vens])
121-
driver.vis_ldavis()
122-
```
123-
- read code comments for more information
100+
Look at example driver and `twitterLda/lda_driver.py` for information.

readme_old.md

+115
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
# Twitter and Foursquare data collection and processing modules
2+
3+
This is a collection of python code utilized in a research paper regarding topic modelling of texts ("shouts") from a location-based social network (Foursquare). Data is gathered initially from Twitter tweets gathered from Twitter Streaming API, which are then cross-referenced using Foursquare/Swarm API to retrieve the original shout text. Shouts are then filtered, processed, tokenized, and analyzed using LDA (Latent Dirichlet Allocation) algorithm. The generated LDA model can then be visualized using several techniques. The code assume input data of shouts/check-ins each associated with a user and a "venue." Each venue is a place with name, coordinates, and venue_id. Visualization techniques are primarily used to compare different venues to each other.
4+
5+
#
6+
7+
## Hyper parameters
8+
9+
- Tokenization library to use (Twokenize, Gensim, or NLTK Tweet Tokenizer) => (twokenize, gensim, tweet)
10+
- no. of topics
11+
- no. of passes to run LDA
12+
- Type of alpha (auto, symmetric)
13+
14+
## Visualizations
15+
16+
- Multidimensional scaling (MDS)
17+
- Distance matrix (Hellinger Distance)
18+
- matplotlib
19+
- text
20+
- Week-days distribution of Hellinger Distance (?)
21+
- Top terms for each topic
22+
23+
## Modules
24+
25+
Basically a flow of information from top to bottom
26+
27+
___Most files contain hardcoded directory names. Users should make sure that the names are correct before use.___
28+
29+
#### scrape.py
30+
31+
- Collect data from Twitter Streaming API of tweets that
32+
- are English
33+
- contain terms "4sq", "foursquare", or "swarmapp" (hardcoded)
34+
- Usage:
35+
- edit customer keys, customer secret, token, token secret in the source file
36+
- can also edit output file name
37+
- run script as main
38+
39+
#### filter.py
40+
41+
- filter only tweets that
42+
- are English
43+
- have geo data
44+
- in California
45+
- also split all tweets into files as chunks of 500 tweets each to be used in next step
46+
- Usage:
47+
- edit file names and directories in file (hardcoded)
48+
- run script as main
49+
50+
#### twitter2foursquare.py
51+
52+
- read tweets and cross-reference them with Foursquare API shouts data
53+
- 500 requests are made per hour as per Foursquare's API limit
54+
- Usage:
55+
- edit access token in the file
56+
- edit file names and directories in file (hardcoded)
57+
- run script as main
58+
59+
#### transform_shouts.py
60+
61+
- select only important fields from shout data
62+
- Usage:
63+
- edit file names and run this thing
64+
65+
#### sqlite_models.py
66+
67+
- contain definition for sqlite database model used to store venues information
68+
- also include code to create the database
69+
- Usage:
70+
- edit file names to read transformed shouts file and run this thing once to insert data into database
71+
72+
#### sqlite_quries.py
73+
74+
- contain scripts to access database and gather information
75+
- also include code to generate an *essential* collection of text files containing data from each venue
76+
- will generate a file for each venue in the /data/ven folder
77+
- Usage:
78+
- edit file names and run this thing once after a set of data collection to generate files from data base!
79+
80+
#### lda_driver.py
81+
82+
- provide a class "LdaDriver" used to train an LDA model with shouts data collected
83+
- also house the visualization part
84+
- Usage:
85+
- create LdaDriver object with desired parameters
86+
- corpus_type: tokenizer to user (twokenize, gensim, tweet)
87+
- num_topics: number of topics
88+
- num_passes: number of passes
89+
- alpha: type of alpha to use in learning (symmetric, auto)
90+
- make_corpus: create a corpus for the tokenizer or not (True, False) (use this when using a new tokenizer or new dataset)
91+
- make_lda: train LDA or not (True, False) (use this when try a new set of parameters; can set to False if just want to visualize)
92+
- example:
93+
```python
94+
driver_settings = {'corpus_type':'twokenize',
95+
'num_topics':30,
96+
'num_passes':20,
97+
'alpha':'symmetric',
98+
'make_corpus':False,
99+
'make_lda':True}
100+
driver = LdaDriver(**driver_settings)
101+
```
102+
- call methods of LdaDriver for different visualizations
103+
- .vis_heatmap
104+
- .vis_MDS
105+
- .vis_ldavis
106+
- .print_dist_matrix
107+
- also functions from the namespace (wow, fancy!)
108+
- process_temporal_shouts_by_weekday(venue_id)
109+
- example:
110+
```python
111+
driver.vis_heatmap(driver.dist_matrix, [ven.name for ven in driver.vens])
112+
driver.vis_MDS(driver.dist_matrix, [ven.name for ven in driver.vens])
113+
driver.vis_ldavis()
114+
```
115+
- read code comments for more information

0 commit comments

Comments
 (0)