This repo details the implementation of a Transformer based model for Visual Question Answering. Given an input image and a question about that image, it aims to answer the question appropriately.
The input questions are passed through an embedding layer and a Transformer encoder model as shown below.
The output of shape (batch_size,ques_seq_length,d_model) is average pooled along the temporal dimension. The decoder takes the attention vector output by the transformer and the input image (which has been passed through VGG16 except the top layer) and passes it through a Bahdanau attention mechanism followed by GRU layers to give an output sequence representing the answer of the question.
The decoder is based on the Bahdanau Attention based seq2seq model decoder utilised for many text generation tasks as shown below:
The resulting model is one that achieves an accuracy comparable to the latest implementations while being more lightweight.
To train the model:
$ python3 main.py
The last sections of main.py contain the code for inference and can either be selectively run or utilised in another python file.