Skip to content

Latest commit

 

History

History
30 lines (20 loc) · 1.72 KB

README.md

File metadata and controls

30 lines (20 loc) · 1.72 KB

VQA_Transformer

This repo details the implementation of a Transformer based model for Visual Question Answering. Given an input image and a question about that image, it aims to answer the question appropriately.

Architecture

The input questions are passed through an embedding layer and a Transformer encoder model as shown below.
transformer

The output of shape (batch_size,ques_seq_length,d_model) is average pooled along the temporal dimension. The decoder takes the attention vector output by the transformer and the input image (which has been passed through VGG16 except the top layer) and passes it through a Bahdanau attention mechanism followed by GRU layers to give an output sequence representing the answer of the question.

The decoder is based on the Bahdanau Attention based seq2seq model decoder utilised for many text generation tasks as shown below:
attention_mechanism

The resulting model is one that achieves an accuracy comparable to the latest implementations while being more lightweight.

Usage

To train the model:

$ python3 main.py

The last sections of main.py contain the code for inference and can either be selectively run or utilised in another python file.

Working

Input Image:
COCO_train2014_000000027511

Input Question and Output Answer:
Capture