Official Code for NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks
arXiv | video
Gradio web-demo for VQA-X
Gradio web-demo for ACT-X
[NEW] Our new work Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks won an honorable mention award at ICCVW! Check it out and check our new NLE datasets: VQA-ParaX and ImageNetX!
- PyTorch 1.8 or higher
- CLIP (install with
pip install git+https://github.com/openai/CLIP.git
) - transformers (install with
pip install transformers
) - accelerate for distributed training (install with
pip install git+https://github.com/huggingface/accelerate
)
We conduct experiments on 4 different V/VL NLE Datasets: VQA-X, ACT-X, e-SNLI-VE and VCR. Please download the images into a folder in your directory named images
using the following links (our code does not use pre-cached visual features. Instead, the features are extracted directly during code execution):
- VQA-X: COCO
train2014
andval2014
images - ACT-X: MPI images. Rename to
mpi
- e-SNLI-VE: Flickr30K images. Rename to
flickr30k
- VCR: VCR images. Rename to
vcr
We structure the annotations for the NLE datasets. You can dowloaded the structured annotations from here: VQA-X, ACT-X, e-SNLI-VE, VCR. Place them in nle_data/dataset_name/
directory. dataset_name
can be {VQA-X, ACT-X, eSNLI-VE, VCR}
. The pretraining annotations are here. Please see this issue also for clarification on which pretrain annotations to use. If you want to preprocess yourself rather than downloading the annotations directly, the code can be found in utils/nle_preprocess.ipynb
.
You also need cococaption and the annotations in the correct format in order to perform evaluation on NLG metrics.
We use the cococaption python3 toolkit here. Please download it and place the cococaption
folder in your directory. The annotations in the correct format can be downloaded here. Please place them in the annotations
folder. If you want to convert the natural language explanations data from the source to the format that cococaption expects for evaluation manually rather than downloading it directly, the code can be found in utils/preprocess_for_cococaption_eval.ipynb
.
You will also need BertScore if you evaluate using it. You may install with pip install bert_score==0.3.7
1 GPU is enough for finetuning on NLE. However if you wish to do distributed training, please setup first using accelerate
. Note that you can still use accelerate
even if you have 1 GPU. In your environment command line, type:
accelerate config
and answer the questions.
Please run from the command line with:
accelerate launch vqaX.py
Note: To finetune from the pretrained captioning model, please set the finetune_pretrained
flag to True
.
Please run from the command line with:
accelerate launch actX.py
Note: To finetune from the pretrained captioning model, please set the finetune_pretrained
flag to True
.
Please run from the command line with:
accelerate launch esnlive.py
Please run from the command line with:
accelerate launch esnlive_concepts.py
Please run from the command line with:
accelerate launch vcr.py
This will give you the unfiltered scores. After that, we use BERTScore to filter the incorrect answers and get the filtered scores (see paper Appendix for more details). Since BERTScore takes time to calculate, it is not ideal to run it and filter scores after every epoch. Therefore, we perform this operation once on the epoch with the best unfiltered scores. Please run:
python vcr_filter.py
All models can be downloaded from the links below:
- Pretrained Model on Image Captioning: link
- VQA-X (w/o pretraining): link
- VQA-X (w/ pretraining): link
- ACT-X (w/o pretraining): link
- ACT-X (w/ pretraining): link
- Concept Head + Wordmap (used in e-SNLI-VE w/ concepts): link
- e-SNLI-VE (w/o concepts): link
- e-SNLI-VE (w/ concepts): link
- VCR: link
Note: Place the concept model and its wordmap in a folder: pretrained_model/
The output results (generated text) on the test dataset can be downloaded from the links below. _filtered
means that the file contains only the explanations for which the predicted answer is correct.
_unfiltered
means that all the explanations are included, regardless of whether the predicted answer is correct or not.
_full
means the full output prediction (inclusing the answer + explanation). _exp
means the explanation part only. All evaluation is performed on _exp
.
See section 4 of the paper for more details.
- VQA-X (w/o pretraining): link
- VQA-X (w/ pretraining): link
- ACT-X (w/o pretraining): link
- ACT-X (w/ pretraining): link
- e-SNLI-VE (w/o concepts): link
- e-SNLI-VE (w/ concepts): link
- VCR: link
Please note that in case of VCR, the results shown in Page 4 of the appendix may not identically correspond to the results and pretrained model in the links above. We have trained several models and randomly picked one for presenting the qualitative results.
Please see explain_predict
and retrieval_attack
folders.