6.S191 Deep Learning Project: Classifying Activities from information of various source types.
Snapshot of Reviews Collected From Yelp
Snapshot of Website Information Collected from Google
Specifically, I removed all stop words, numbers, punctuation, and non-english words (not taking into account mis-spelling). I then tokenized by words and stored them in an array.
The model predicted 51% of the google website correctly to the yelp reviews associated with the activity. Given the limited data set we have, I was happy with the results (the model only took 10 sec to train). Obviously, the biggest limitation in this model is the vocabulary. Many of the words in non-training samples are not found in the vocabulary. Additionally, with limited data, it is especially hard to make predictions on data formated differently than the training data. In this case, I simply concatenated all the information provided by Google for each website. Ideally, I would attempt this again by training on a variety of information and preclassify like activities. In the training set, there were 3 escape room activities. It's no wonder that the model preformed poorly on most of those activities. Also, the descriptions of the lawn on boston and cambridge center roof garden are difficult to distinguish (even by hand once names were taken out).
The neural network trained on BBC News achieved better preformance with classifying activities within the same topics as trained on, but we did not get a chance to test other forms of information. Much of the improved preformance we speculate has to do with the more extensive training data. In the future, we hope to merge these two approaches to achieve higher resolution in activity description and a large breadth of activity classification by changing the approaches in which the data is organized for classification (see below).
In a model attempting to classify activities from people's lives, it will be important to get a time and location stamp to help strengthen activities that should be grouped together. I propose first collecting a substantial database of journals and information relating to activites of those people who journaled. Then I would first group information by location and time. I would further train a model simply for weeding out non-similar data. Then I would train a seperate model for recognizing similar type data. Importantly, the two approaches for cleaning the data and then training on the final model will need to be different and require more thought.