Skip to content

6.S191 Deep Learning Project: Classifying Activities from information of various source types.

Notifications You must be signed in to change notification settings

jacob-hansen/Multimodal-Activity-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multimodal-Activity-Classification

6.S191 Deep Learning Project: Classifying Activities from information of various source types.

1. Initial Data is Collected or Scrapped From Yelp, BBC, or Google Search

Snapshot of Reviews Collected From Yelp




Snapshot of Website Information Collected from Google





2. Text Processing Tokenized Text

Specifically, I removed all stop words, numbers, punctuation, and non-english words (not taking into account mis-spelling). I then tokenized by words and stored them in an array.


2. Trained a Gensim Word2Vec Model and Compared Outputs for Associated and Disassociated Content

The model predicted 51% of the google website correctly to the yelp reviews associated with the activity. Given the limited data set we have, I was happy with the results (the model only took 10 sec to train). Obviously, the biggest limitation in this model is the vocabulary. Many of the words in non-training samples are not found in the vocabulary. Additionally, with limited data, it is especially hard to make predictions on data formated differently than the training data. In this case, I simply concatenated all the information provided by Google for each website. Ideally, I would attempt this again by training on a variety of information and preclassify like activities. In the training set, there were 3 escape room activities. It's no wonder that the model preformed poorly on most of those activities. Also, the descriptions of the lawn on boston and cambridge center roof garden are difficult to distinguish (even by hand once names were taken out).

The neural network trained on BBC News achieved better preformance with classifying activities within the same topics as trained on, but we did not get a chance to test other forms of information. Much of the improved preformance we speculate has to do with the more extensive training data. In the future, we hope to merge these two approaches to achieve higher resolution in activity description and a large breadth of activity classification by changing the approaches in which the data is organized for classification (see below).

In a model attempting to classify activities from people's lives, it will be important to get a time and location stamp to help strengthen activities that should be grouped together. I propose first collecting a substantial database of journals and information relating to activites of those people who journaled. Then I would first group information by location and time. I would further train a model simply for weeding out non-similar data. Then I would train a seperate model for recognizing similar type data. Importantly, the two approaches for cleaning the data and then training on the final model will need to be different and require more thought.

About

6.S191 Deep Learning Project: Classifying Activities from information of various source types.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published