The aim of this problem was the correct emotions prediction of the sentences in this dataset: https://www.kaggle.com/datasets/praveengovi/emotions-dataset-for-nlp.
The train, validation and test datasets with both sentences and emotion labels (sadness,anger,love,surprise,fear,joy) were provided.
In this repository there are the following notebooks:
- Preprocessing_EDA_LSTM.ipynb is the notebook that I used to preprocess the data, apply exploratory data analysis, and train with an LSTM layer the data. A 100 GloVe encoding dimension vector developed by Stanford University was used to encode the words in the datasets.
- EDA_LSTM_50_encodings.ipynb and EDA_LSTM_200_encodings.ipynb that are similar to the previous one, but using 50 and 200 dimension encoding vectors respectively.
- LSTM_Conv1d.ipynb where I applied an LSTM layer with 100 dimension encoding vectors.
- LSTM_LSTM.ipynb that is a test with two LSTM layers.
- Bidirectional_LSTM_Conv1d.ipynb where I used a Bidirectional LSTM with a CONV1d layer stacked above it.
- Pretrained_BERT.ipynb in which I applied BERT.
- Pretrained_Bert_stopword_lemmatizer.ipynb in which I applied stopword and lemmatization followed by BERT. It was the model with the highest accuracy.
- Sentiment_prediction.ipynb is a class prediction notebook based on a single sentence that the user gives as input.
In all the notebooks used for training, I applied on top of the layers a fully connected neural network with 6 neurons as output layer and a variable number of neurons, dropout and hidden layers.
I used the Google Colab GPU to train all the models except for the BERT with stopwords and lemmatizer and LSTM_Conv1d.ipynb, in which I used my local GPU.
The embeddings were downloaded from here https://nlp.stanford.edu/projects/glove/ and then were transformed into a dictionary and saved, thus requiring less time to be loaded. The same thing has been done with embeddings of 50 and 200 length vectors.
Unfortunately, It wasn't possible to store the GloVe embeddings in the data
folder since they occupy more than 100 MB of memory, but it's possible to download them from the website.
The datasets were loaded too and the words were tokenized and padded to a fixed maximum length.
Furthermore, labels were encoded based on the train data with the following encoder from keras LabelEncoder()
, and GloVe weights were added based on the words present in the train dataset.
For BERT models, the words were tokenized with the AutoTokenizer
class from transformers
library, using the from_pretrained()
method and using bert-base-cased
as argument. This is because the input of the model expects 2 features ("input_ids" and "attention_mask") that can be obtained with the mentioned tokenizer.
The input of the tokenizer was made by words that were lemmatized (with WordNetLemmatizer()
class) and on which stopwords were applied. The lemmatizer and the stopwords were downloaded from the NLTK library.
Instead, for the LSTM and the other tested variations Tokenizer()
class from keras.preprocessing.text
was used.
For this sentiment analysis problem, 2 types of graphs were plotted.
The first one depicts a wordcloud graph, imported using the library called wordcloud
.
It is clear that the words 'feel' and 'feeling' are the most common words for all the three datasets. This is due to the fact that these are the main verbs used to describe the majority of sentiments among the 6 classes.
Below is the code used to plot the 3 graphs:
def plot_cloud(wordcloud,intt,dataset):
axes[intt].set_title('Word Cloud '+dataset+' dataset', size = 19,y=1.04)
axes[intt].imshow(wordcloud)
axes[intt].axis("off"); # No axis details
from wordcloud import WordCloud
fig, axes = plt.subplots(3,1, figsize=(25, 41), sharey=True)
wordcloud = WordCloud(width = 600, height = 600,background_color = 'White',max_words=1000,repeat=False,min_font_size=5,collocations=False).generate(b_train) #collocation: gets rid of the repeated words
plot_cloud(wordcloud,0,'train')
wordcloud = WordCloud(width = 600, height = 600,background_color = 'White',max_words=1000,repeat=False,min_font_size=5,collocations=False).generate(b_val)
plot_cloud(wordcloud,1,'validation')
wordcloud = WordCloud(width = 600, height = 600,background_color = 'White',max_words=1000,repeat=False,min_font_size=5,collocations=False).generate(b_test)
plot_cloud(wordcloud,2,'test')
The second type of plot that I coded was a barplot representing the number of sentences for each label.
Thus, the dataset is unbalanced with 'sadness' and 'joy' labels dominating over the others.
In the table below I summed up the model used, the tokenizer, the number of total and trainable parameters, and the associated accuracy on validation dataset.
Model | Tokenizer | # total params | # trainable params | Validation Accuracy |
---|---|---|---|---|
LSTM 100 encodings | Tokenizer() | 1,478,086 | 59,586 | 84.55 |
LSTM 200 encodings | Tokenizer() | 3,041,058 | 204,058 | 83.80 |
LSTM 50 encodings | Tokenizer() | 753,836 | 44,586 | 80.90 |
LSTM conv1D 100 encodings | Tokenizer() | 1,473,970 | 55,470 | 83.90 |
LSTM-LSTM 100 encodings | Tokenizer() | 1,498,046 | 79,546 | 84.00 |
Bert | AutoTokenizer.from_pretrained('bert-base-cased') | 108,420,460 | 108,420,460 | 85.55 |
Bert stopword-lemmatizer | AutoTokenizer.from_pretrained('bert-base-cased') | 108,420,460 | 108,420,460 | 93.75 |
BidirectionalLSTM-Conv1d | Tokenizer() | 1,500,257 | 81,757 | 81.60 |
The largest accuracy was obtained on the BERT with stopword and lemmatization.
The model was a pretrained model written by Hugging Face, and I fetched it with the Tensorflow method TFBertModel.from_pretrained('bert-base-cased')
.
The input of the BERT were the 2 features obtained with the already mentioned tokenizer ( AutoTokenizer.from_pretrained('bert-base-cased')
).
Below, I reported the details about this model, the optimizer and the trained epochs.
BERT with stopword and lemmatizer - Model
Layer (type) | Output Shape | Param # | Connected to |
---|---|---|---|
input_ids (InputLayer) | [(None, 43)] | 0 | |
attention_mask (InputLayer) | [(None, 43)] | 0 | |
tf_bert_model_1 (TFBertModel) | TFBaseModelOutput | 108310272 | input_ids[0][0] |
attention_mask[0][0] | |||
global_max_pooling1d_1 (GlobalM | (None, 768) | 0 | tf_bert_model_1[1][0] |
dense_3 (Dense) | (None, 138) | 106122 | global_max_pooling1d_1[0][0] |
dropout_75 (Dropout) | (None, 138) | 0 | dense_3[0][0] |
dense_4 (Dense) | (None, 28) | 3892 | dropout_75[0][0] |
dense_5 (Dense) | (None, 6) | 174 | dense_4[0][0] |
BERT with stopword and lemmatizer - Optimizer
{'name': 'Adam', 'clipnorm': 1.0, 'learning_rate': 5e-05, 'decay': 0.01, 'beta_1': 0.9, 'beta_2': 0.999, 'epsilon': 1e-08, 'amsgrad': False}
BERT with stopword and lemmatizer - Training and validation accuracy
Epoch 1/3
1334/1334 [==============================] - 959s 698ms/step - loss: 0.3806 - accuracy: 0.8639 - val_loss: 0.1853 - val_accuracy: 0.9315
Epoch 2/3 1334/1334 [==============================] - 924s 693ms/step - loss: 0.1362 - accuracy: 0.9416 - val_loss: 0.1656 - val_accuracy: 0.9305
Epoch 3/3 1334/1334 [==============================] - 933s 700ms/step - loss: 0.1113 - accuracy: 0.9482 - val_loss: 0.1550 - val_accuracy: 0.9375
The model reached a 93.75% validation accuracy and 94.84% accuracy on train dataset. On test dataset, the model reached an accuracy of 92.95%, with a loss of 0.1699.
In this notebook, I applied the trained model to a new sentence defined by the user in a more compact form. When running the script, it automatically applies the preprocessing steps and the evaluation, yielding the class prediction of the sentence as output.
Here, there is an example of a defined input sentence along with the prediction given by the model:
In [2] : y=input()
I am astonished by what you've accomplished #input text
Out [3] : 'surprise' #class prediction