Course Project submission for the course CS6910 Fundamentals of Deep Learning.
Check this link for the task description: Problem Statement link
Check this link for the report : Report link
Team Members : Vamsi Sai Krishna Malineni (OE20S302), Mohammed Safi Ur Rahman Khan (CS21M035)
The purpose of this course project are :
- Building a transliteration system using Recurrent Neural Networks.
- Comparing different cells such as vanilla RNN, LSTM and GRU.
- Implementing attention mechanism and understand how these overcome the limitations of vanilla seq2seq models.
- Fine tuning GPT2 transformer model to generate lyrics based on the prompt given.
The link to the wandb report:
- Install the required libraries using the following command :
pip install -r requirements.txt
-
The is presented in the following notebooks :
RNN.ipynb
: This notebook corresponds to training an Seq2Seq Model without attentionRNN_with_Attention.ipynb
: This notebook corresponds to training an Seq2Seq Model with attention, visualizing the connectivity and plotting the attention mapsLyrics_Generation.ipynb
: This notebook corresponds to training GPT2 transformer model for lyrics generation Along with jupyter notebooks we have also given python code filed (.py) files. These contain the code to direclty train and test the model in a non interactive way.
-
For seeing the outputs and the various explanations, and how the code has developed, please check the jupyter notebooks . For training and testing the model from command line, run the (.py) file by following the instructions given in below sections.
-
If you are running the jupyter notebooks on colab, the libraries from the
requirements.txt
file are preinstalled, with theexception
of the following:wandb
transformers
datasets
You can install wandb by using the following command :
!pip install wandb
!pip install transformers
!pip install datasets
- The dataset for the RNN part of the assignment can be found at : RNN Dataset Link
- The dataset for the Transformers part of the assignment can be found at : Transformer Dataset Link
As mentioned earlier, there are two files. One is a jupyter notebook and the other is the python code file.
The jupyter notebook has the outputs still intact so that can be used for reference.
The python file has all the functions and the code used in the jupyter file (along with some additional code that can be used to run from the command line)
The python file can be run from the terminal by passing the various command line arguments. Please make sure that the three files of the dataset are as it is present in the same directory as this python file
There are two modes of running this file
1. Running the hyperparameter sweeps using wandb
python rnn_without_attention.py --sweep yes
The code will now run in the sweep mode and will enable wandb integration and log all the data in wand. Make sure you have wandb installed if you want to run in this mode. Also, change the entity name and project name in the code before running in this mode
2. Running in normal mode
python rnn_without_attention.py --sweep XXX --cell XXX --embedSize XXX --dropout XXX --numLayers XXX --hiddenLayerSize XXX --numEpochs XXX --batchSize XXX --optimizer XXX
Replace XXX
in above with the appropriate parameter you want to train the model with
For example:
python rnn_without_attention.py --sweep no --cell LSTM --embedSize 64 --dropout 0.2 --numLayers 1 --hiddenLayerSize 32 --numEpochs 2 --batchSize 32 --optimizer adam
Description of various command line arguments
--sweep
: Do you want to sweep or not: Enter 'yes' or 'no'. If this is 'yes' then below arguments are not required. Enter below arguments only if this is 'no'--cell
: Cell Type: LSTM or RNN or GRU--embedSize
: Input Embedding Size: integer value--dropout
: Dropout: float value--numLayers
: Number of Encoder/Decoder Layers: integer value--hiddenLayerSize
: Hidden units in cell: integer value--numEpochs
: Number of Epochs: integer value--batchSize
: Batch Size: integer value--optimizer
: Optimizer function: adam or nadam or rmsprop
This can be run in a sequential manner. i.e., one cell at a time. This notebook also has the code for plotting the various images required for the assignment.
The training and validation data in encoded form is obtained using the following code snippet:
encoder_input_data, decoder_input_data, decoder_target_data, encoder_val_input_data, decoder_val_input_data, decoder_val_target_data = one_hot_encoding(input, target, val_input, val_target, input_tokens, target_tokens)
where :
input
: array of words of the input language (i.e. English) present in training datasettarget
: array of words of the target language (i.e. Telugu) present in training datasetval_input
: array of words of the input language (i.e. English) present in validation datasetval_target
: array of words of the target language (i.e. Telugu) present in validation datasetinput_tokens
: list of charaters in the input languagetarget_tokens
: list of characters in the target language
The following function is used to build the RNN model:
build_model(num_encoders, num_decoders, cell, embed_size, dropout, hidden_layer_size)
where :
num_encoders
: number of encoder layersnum_decoders
: number of decoder layerscell
: type of the rnn cell ( simpleRNN, LSTM ,GRU )embed_size
: size of the embeddingdropout
: % of dropout (in decimals)hidden_layer_size
: dimensionalilty of output space
Use the following function to perform hyperparameter sweeps in wandb:
sweeper(project_name,entity_name)
where:
project_name
: Enter the wandb project nameentity_name
: Enter the wandb entity name
The various hyperparameters used are :
hyperparameters = {
"cell": {"values": ["RNN","GRU","LSTM"]},
"embed_size": {"values": [16,32,64,256]},
"hidden_layer_size": {"values": [16,32,64,256]},
"num_layers": {"values": [1,2,3]},
"dropout": {"values": [0.2,0.3,0.4]},
"epochs": {"values": [5,10,15,20]},
"batch_size": {"values": [32,64]},
"optimizer": {"values": ["adam", "rmsprop", "nadam"]}
}
The test data in encoded form is obtained by using the following code snipppet :
encoder_test_input_data, decoder_test_input_data, decoder_test_target_data = one_hot_encoding_test(test_input, test_target, input_tokens, target_tokens)
where :
test_input
: array of words of the input language (i.e. English) present in test datasettest_target
: array of words of the target language (i.e. Telugu) present in test datasetinput_tokens
: list of charaters in the input languagetarget_tokens
: list of characters in the target language
The best performing model from the hyperparameter sweeps is fed with test data. Use the following code snippet to find the test accuracy :
testing()
As mentioned earlier, there are two files. One is a jupyter notebook and the other is the python code file.
The jupyter notebook has the outputs still intact so that can be used for reference.
The python file has all the functions and the code used in the jupyter file (along with some additional code that can be used to run from the command line)
The python file can be run from the terminal by passing the various command line arguments. Please make sure that the three files of the dataset are as it is present in the same directory as this python file
There are two modes of running this file
1. Running the hyperparameter sweeps using wandb
python rnn_with_attention.py --sweep yes
The code will now run in the sweep mode and will enable wandb integration and log all the data in wand. Make sure you have wandb installed if you want to run in this mode. Also, change the entity name and project name in the code before running in this mode
2. Running in normal mode
python rnn_with_attention.py --sweep XXX --cell XXX --embedSize XXX --dropout XXX --numLayers XXX --hiddenLayerSize XXX --numEpochs XXX --batchSize XXX --optimizer XXX
Replace XXX
in above with the appropriate parameter you want to train the model with
For example:
python rnn_with_attention.py --sweep no --cell LSTM --embedSize 64 --dropout 0.2 --numLayers 1 --hiddenLayerSize 32 --numEpochs 2 --batchSize 32 --optimizer adam
Description of various command line arguments
--sweep
: Do you want to sweep or not: Enter 'yes' or 'no'. If this is 'yes' then below arguments are not required. Enter below arguments only if this is 'no'--cell
: Cell Type: LSTM or RNN or GRU--embedSize
: Input Embedding Size: integer value--dropout
: Dropout: float value--numLayers
: Number of Encoder/Decoder Layers: integer value--hiddenLayerSize
: Hidden units in cell: integer value--numEpochs
: Number of Epochs: integer value--batchSize
: Batch Size: integer value--optimizer
: Optimizer function: adam or nadam or rmsprop
This can be run in a sequential manner. i.e., one cell at a time. This notebook also has the code for plotting the various images required for the assignment.
The training and validation data in encoded form is obtained using the following code snippet:
encoder_input_data, decoder_input_data, decoder_target_data, encoder_val_input_data, decoder_val_input_data, decoder_val_target_data = one_hot_encoding(input, target, val_input, val_target, input_tokens, target_tokens)
where :
input
: array of words of the input language (i.e. English) present in training datasettarget
: array of words of the target language (i.e. Telugu) present in training datasetval_input
: array of words of the input language (i.e. English) present in validation datasetval_target
: array of words of the target language (i.e. Telugu) present in validation datasetinput_tokens
: list of charaters in the input languagetarget_tokens
: list of characters in the target language
The following function is used to build the RNN model:
build_model(num_encoders, num_decoders, cell, embed_size, dropout, hidden_layer_size)
where :
num_encoders
: number of encoder layersnum_decoders
: number of decoder layerscell
: type of the rnn cell ( simpleRNN, LSTM ,GRU )embed_size
: size of the embeddingdropout
: % of dropout (in decimals)hidden_layer_size
: dimensionalilty of output space
Use the following function to perform hyperparameter sweeps in wandb:
sweeper(project_name,entity_name)
where:
project_name
: Enter the wandb project nameentity_name
: Enter the wandb entity name
The various hyperparameters used are :
hyperparameters = {
"cell":{ "values":["RNN","GRU","LSTM"]},
"embed_size":{ "values":[16,32,64,256]},
"hidden_layer_size":{"values":[16,32,64,256]},
"num_layers":{"values":[1,2,3]},
"dropout":{"values":[0.2,0.3,0.4]},
"epochs":{"values":[5,10,15,20]},
"batch_size":{"values":[32,64]},
"optimizer":{"values":["adam", "rmsprop", "nadam"]}
}
The test data in encoded form is obtained by using the following code snipppet :
encoder_test_input_data, decoder_test_input_data, decoder_test_target_data = one_hot_encoding_test(test_input, test_target, input_tokens, target_tokens)
where :
test_input
: array of words of the input language (i.e. English) present in test datasettest_target
: array of words of the target language (i.e. Telugu) present in test datasetinput_tokens
: list of charaters in the input languagetarget_tokens
: list of characters in the target language
The best performing model from the hyperparameter sweeps is fed with test data. Use the following code snippet to find the test accuracy :
testing()
The attention maps for the model can be built running the following section of ipynb notebook :
Plotting attention maps
The connectivity can be visualized using either of the following functions :
visualize_gif(index)
visualize(index)
where :
visualize_gif(index)
displays a gif format of connectivity visualization of the word at the givenindex
visulaize(index)
displays an image format of connectivity visualization of the word at the givenindex
The resources used are :
- https://towardsdatascience.com/visualising-lstm-activations-in-keras-b50206da96ff
- https://machinelearningmastery.com/define-encoder-decoder-sequence-sequence-model-neural-machine-translation-keras/
- https://keras.io/examples/nlp/lstm_seq2seq/
The dataset used for fine tuning GPT2 model can be found here : https://www.kaggle.com/neisse/scrapped-lyrics-from-6-genres
This dataset contains lyrics from 79 muscial genres (data is scraped from the website vagalume.com.br)
The notebook is split into these following segments :
Library imports
: This section imports the required libraries for the task.Data Preperation
: This section builds the train,test and validation datasets.Model Training
: This section deals with the model training by runningrun_clm.py
file.Lyrics Generation
: This section generates lyrics by runningrun_generation.py
file. The lyrics generation is based on the prompt given : " I love deep learning ".
The resources used for this question are as follows: