VQA_Transformer

This repo details the implementation of a Transformer based model for Visual Question Answering. Given an input image and a question about that image, it aims to answer the question appropriately.

Architecture

The input questions are passed through an embedding layer and a Transformer encoder model as shown below.

The output of shape (batch_size,ques_seq_length,d_model) is average pooled along the temporal dimension. The decoder takes the attention vector output by the transformer and the input image (which has been passed through VGG16 except the top layer) and passes it through a Bahdanau attention mechanism followed by GRU layers to give an output sequence representing the answer of the question.

The decoder is based on the Bahdanau Attention based seq2seq model decoder utilised for many text generation tasks as shown below:

The resulting model is one that achieves an accuracy comparable to the latest implementations while being more lightweight.

Usage

To train the model:

$ python3 main.py

The last sections of main.py contain the code for inference and can either be selectively run or utilised in another python file.

Working

Input Image:

Input Question and Output Answer:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

VQA_Transformer

Architecture

Usage

Working

Files

README.md

Latest commit

History

README.md

File metadata and controls

VQA_Transformer

Architecture

Usage

Working