ViLBERT

ViLBERT_beta has been deprecated. Please see vilbert-multi-task, which includes implementations for 12-in-1: Multi-Task Vision and Language Representation Learning

Code and pre-trained models for ViLBERT: Pretraining Task-Agnostic VisiolinguisticRepresentations for Vision-and-Language Tasks.

*Note: This codebase is still in beta release to replicate the paper's preformance. *

Repository Setup

Create a fresh conda environment, and install all dependencies.

conda create -n vilbert python=3.6
conda activate vilbert
git clone https://github.com/jiasenlu/vilbert_beta
cd vilbert_beta
pip install -r requirements.txt

Install pytorch

conda install pytorch torchvision cudatoolkit=10.0 -c pytorch

Install apx, follows https://github.com/NVIDIA/apex
compile tools

cd tools/refer
make

Data Setup

Check README.md under data for more details. Check vlbert_tasks.yml for more details.

Pre-trained model for Evaluation

Model	Objective	Link
ViLBERT 2-Layer	Conceptual Caption	Google Drive
ViLBERT 4-Layer	Conceptual Caption	Google Drive
ViLBERT 6-Layer	Conceptual Caption	Google Drive
ViLBERT 8-Layer	Conceptual Caption	Google Drive
ViLBERT 6-Layer	VQA	Google Drive
ViLBERT 6-Layer	VCR	Google Drive
ViLBERT 6-Layer	RefCOCO+	Google Drive
ViLBERT 6-Layer	Image Retrieval	Google Drive

Evaluation

Zero-Shot Image Retrieval

We can directly use the Pre-trained ViLBERT model for zero-shot image retrieval tasks on Flickr30k.

1: Download the pretrained model with objective Conceptual Caption and put it under save

2: Update featyres_h5path1 and val_annotations_jsonpath in vlbert_task.yml to load the Flickr30k testset image feature and jsonfile (defualt is training feature).

3: Use the following command to evaluate pre-trained 6 layer ViLBERT model. (only support single GPU for evaluation now):

python eval_retrieval.py --bert_model bert-base-uncased --from_pretrained save/bert_base_6_layer_6_connect/pytorch_model_9.bin --config_file config/bert_base_6layer_6conect.json --task 3 --split test --batch_size 1 --zero_shot

Image Retrieval

1: Download the pretrained model with objective Image Retrieval and put it under save

2: Update featyres_h5path1 and val_annotations_jsonpath in vlbert_task.yml to load the Flickr30k testset image feature and jsonfile (defualt is training feature).

3: Use the following command to evaluate pre-trained 6 layer ViLBERT model. (only support single GPU for evaluation now):

python eval_retrieval.py --bert_model bert-base-uncased --from_pretrained save/RetrievalFlickr30k_bert_base_6layer_6conect-pretrained/pytorch_model_19.bin --config_file config/bert_base_6layer_6conect.json --task 3 --split test --batch_size 1

VQA

1: Download the pretrained model with objective VQA and put it under save

2: To test on held out validation split, use the following command:

python eval_tasks.py --bert_model bert-base-uncased --from_pretrained save/VQA_bert_base_6layer_6conect-pretrained/pytorch_model_19.bin --config_file config/bert_base_6layer_6conect.json --task 0 --split minval

VCR

1: Download the pretrained model with objective VCR and put it under save

2: To test on VCR Q->A

python eval_tasks.py --bert_model bert-base-uncased --from_pretrained save/VCR_Q-A-VCR_QA-R_bert_base_6layer_6conect-pretrained/pytorch_model_19.bin --config_file config/bert_base_6layer_6conect.json --task 1 --split val

3: To test on VCR QA->R

python eval_tasks.py --bert_model bert-base-uncased --from_pretrained save/VCR_Q-A-VCR_QA-R_bert_base_6layer_6conect-pretrained/pytorch_model_19.bin --config_file config/bert_base_6layer_6conect.json --task 2 --split val

RefCOCO+

1: Download the pretrained model with objective RefCOCO+ and put it under save

2: We use the Pre-computed detections/masks from MAttNet for fully-automatic comprehension task, Check the MAttNet repository for more details.

3: To test on the RefCOCO+ val set and use the following command:

python eval_tasks.py --bert_model bert-base-uncased --from_pretrained save/refcoco+_bert_base_6layer_6conect-pretrained/pytorch_model_19.bin --config_file config/bert_base_6layer_6conect.json --task 4

Visiolinguistic Pre-training

Once you extracted all the image features, to train a 6-layer ViLBERT model on conceptual caption:

python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 train_concap.py --from_pretrained bert-base-uncased --bert_model bert-base-uncased --conf
ig_file config/bert_base_6layer_6conect.json --learning_rate 1e-4 --train_batch_size 512 --save_name pretrained

Train ViLBERT for DownStream Tasks

VQA

To fintune a 6-layer ViLBERT model for VQA with 8 GPU. --tasks 0 means VQA tasks. Check vlbert_tasks.yml for more settings for VQA tasks.

python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 train_tasks.py --bert_model bert-base-uncased --from_pretrained save/bert_base_6_layer_6_connect_freeze_0/pytorch_model_8.bin  --config_file config/bert_base_6layer_6conect.json  --learning_rate 4e-5 --num_workers 16 --tasks 0 --save_name pretrained

VCR

Similarly, to finetune a 6-layer vilbert model for VCR task, run the following commands. Here we joint train Q->A and QA->R tasks, so the tasks is specified as --tasks 1-2

python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 train_tasks.py --bert_model bert-base-uncased --from_pretrained save/bert_base_6_layer_6_connect_freeze_0/pytorch_model_8.bin  --config_file config/bert_base_6layer_6conect.json  --learning_rate 2e-5 --num_workers 16 --tasks 1-2 --save_name pretrained

Image Retrieval

python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 train_tasks.py --bert_model bert-base-uncased --from_pretrained save/bert_base_6_layer_6_connect_freeze_0/pytorch_model_8.bin  --config_file config/bert_base_6layer_6conect.json  --learning_rate 4e-5 --num_workers 9 --tasks 3 --save_name pretrained

Refer Expression

python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 train_tasks.py --bert_model bert-base-uncased --from_pretrained save/bert_base_6_layer_6_connect_freeze_0/pytorch_model_8.bin  --config_file config/bert_base_6layer_6conect.json  --learning_rate 4e-5 --num_workers 16 --tasks 4 --save_name pretrained

For single GPU training, use smaller batch size and simply remove -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0

References

If you find this code is useful for your research, please cite our paper

@article{lu2019vilbert,
  title={ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks},
  author={Lu, Jiasen and Batra, Dhruv and Parikh, Devi and Lee, Stefan},
  journal={arXiv preprint arXiv:1908.02265},
  year={2019}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViLBERT

ViLBERT_beta has been deprecated. Please see vilbert-multi-task, which includes implementations for 12-in-1: Multi-Task Vision and Language Representation Learning

Repository Setup

Data Setup

Pre-trained model for Evaluation

Evaluation

Zero-Shot Image Retrieval

Image Retrieval

VQA

VCR

RefCOCO+

Visiolinguistic Pre-training

Train ViLBERT for DownStream Tasks

VQA

VCR

Image Retrieval

Refer Expression

References

Why does ViLBERT look like ?

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
config		config
data		data
fig		fig
script		script
tools		tools
vilbert		vilbert
README.md		README.md
eval_retrieval.py		eval_retrieval.py
eval_tasks.py		eval_tasks.py
requirements.txt		requirements.txt
train_baseline.py		train_baseline.py
train_concap.py		train_concap.py
train_tasks.py		train_tasks.py
vlbert_tasks.yml		vlbert_tasks.yml

Folders and files

Latest commit

History

Repository files navigation

ViLBERT

ViLBERT_beta has been deprecated. Please see vilbert-multi-task, which includes implementations for 12-in-1: Multi-Task Vision and Language Representation Learning

Repository Setup

Data Setup

Pre-trained model for Evaluation

Evaluation

Zero-Shot Image Retrieval

Image Retrieval

VQA

VCR

RefCOCO+

Visiolinguistic Pre-training

Train ViLBERT for DownStream Tasks

VQA

VCR

Image Retrieval

Refer Expression

References

Why does ViLBERT look like ?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages