NLX-GPT

Official Code for NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks
arXiv | video

Gradio web-demo for VQA-X
Gradio web-demo for ACT-X

[NEW] Our new work Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks won an honorable mention award at ICCVW! Check it out and check our new NLE datasets: VQA-ParaX and ImageNetX!

Requirements

PyTorch 1.8 or higher
CLIP (install with pip install git+https://github.com/openai/CLIP.git)
transformers (install with pip install transformers)
accelerate for distributed training (install with pip install git+https://github.com/huggingface/accelerate)

Images Download

We conduct experiments on 4 different V/VL NLE Datasets: VQA-X, ACT-X, e-SNLI-VE and VCR. Please download the images into a folder in your directory named images using the following links (our code does not use pre-cached visual features. Instead, the features are extracted directly during code execution):

VQA-X: COCO train2014 and val2014 images
ACT-X: MPI images. Rename to mpi
e-SNLI-VE: Flickr30K images. Rename to flickr30k
VCR: VCR images. Rename to vcr

Annotations Download

We structure the annotations for the NLE datasets. You can dowloaded the structured annotations from here: VQA-X, ACT-X, e-SNLI-VE, VCR. Place them in nle_data/dataset_name/ directory. dataset_name can be {VQA-X, ACT-X, eSNLI-VE, VCR}. The pretraining annotations are here. Please see this issue also for clarification on which pretrain annotations to use. If you want to preprocess yourself rather than downloading the annotations directly, the code can be found in utils/nle_preprocess.ipynb.

You also need cococaption and the annotations in the correct format in order to perform evaluation on NLG metrics. We use the cococaption python3 toolkit here. Please download it and place the cococaption folder in your directory. The annotations in the correct format can be downloaded here. Please place them in the annotations folder. If you want to convert the natural language explanations data from the source to the format that cococaption expects for evaluation manually rather than downloading it directly, the code can be found in utils/preprocess_for_cococaption_eval.ipynb.

You will also need BertScore if you evaluate using it. You may install with pip install bert_score==0.3.7

Code

1 GPU is enough for finetuning on NLE. However if you wish to do distributed training, please setup first using accelerate. Note that you can still use accelerate even if you have 1 GPU. In your environment command line, type:

accelerate config

and answer the questions.