Skip to content

Commit

Permalink
init_push
Browse files Browse the repository at this point in the history
  • Loading branch information
jiasenlu committed Aug 17, 2019
0 parents commit 86caf86
Show file tree
Hide file tree
Showing 74 changed files with 11,999 additions and 0 deletions.
107 changes: 107 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# ViLBERT <img src="fig/vilbert_trim.png" width="45">

Code and pre-trained models for **ViLBERT: Pretraining Task-Agnostic VisiolinguisticRepresentations for Vision-and-Language Tasks**.


## Repository Setup

1. Create a fresh conda environment, and install all dependencies.

```text
conda create -n vilbert python=3.6
conda activate vilbert
git clone https://github.com/jiasenlu/ViLBert
cd ViLBert
pip install -r requirements.txt
```

2. Install pytorch
```
conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
```

3. Install apx, follows https://github.com/NVIDIA/apex

4. Install this codebase as a package in this environment.
```text
python setup.py develop
```

## Data Setup

Check `README.md` under `data` for more details.

## Visiolinguistic Pre-training

To train the model:

```
To be added
```

For internal use: copy the pre-trained checkpoint from Skynet

```
cp -a /srv/share3/jlu347/vilbert/save/* #to_your_directory.
```

## Benchmark Vision-Lanugage Tasks

| Task | Sub-Task | Model | LR | Results (split) |
| :-----------------------: | :---------------: | :---------: | :--: | :-----------------------------------------------------: |
| **VQA** | - | **ViLBERT** | 4e-5 | **70.55** (test-dev) |
| - | - | DFAF | - | 70.22 (test-dev) |
| **VCR** | Q->A | **ViLBERT** | 2e-5 | **73.3** (test) |
| - | Q->A | R2C | - | 63.8 (test) |
| **VCR** | QA->R | **ViLBERT** | 2e-5 | **74.6** (test) |
| - | QA->R | R2C | - | 67.3 (test) |
| **VCR** | Q->AR | **ViLBERT** | 2e-5 | **54.8** (test) |
| - | Q->AR | R2C | - | 44.0 (test) |
| **Ref Expression** | RefCOCO+ | **ViLBERT** | 4e-5 | **72.34** (val) - **78.52** (testA) - **62.61** (testB) |
| - | RefCOCO+ | MAttNet | - | 65.33 (val) - 71.62 (testA) - 56.02 (testB) |
| **Ref Expression** | RefCOCO | **ViLBERT** | 4e-5 | - |
| - | RefCOCO | MAttNet | - | - |
| **Ref Expression** | Refg | **ViLBERT** | 4e-5 | - |
| - | Refg | MAttNet | - | - |
| **Image Caption Ranking** | Image Retrieval | **ViLBERT** | 2e-5 | **58.20** (R1) - **84.90** (R5) - **91.52** (R10) |
| - | Image Retrieval | SCAN | - | 48.60 (R1) - 77.70 (R5) - 85.20 (R10) |
| **Image Caption Ranking** | Caption Retrieval | **ViLBERT** | 2e-5 | - |
| - | Caption Retrieval | SCAN | - | - |


## TASKS
### VQA

To fintune a 6-layer vilbert model for VQA with 8 GPU. `--tasks 1` means VQA tasks. Check `vlbert_tasks.yml` for more settings for VQA tasks.

```bash
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 train_tasks.py --bert_model bert-base-uncased --from_pretrained save/bert_base_6_layer_6_connect_freeze_0/pytorch_model_8.bin --config_file config/bert_base_6layer_6conect.json --learning_rate 4e-5 --num_workers 16 --tasks 1 --save_name pretrained
```

### VCR

Similarly, to finetune a 6-layer vilbert model for VCR task, run the following commands. Here we joint train `Q->A ` and `QA->R` tasks, so the tasks is specified as `--tasks 6-7`

```
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 train_tasks.py --bert_model bert-base-uncased --from_pretrained save/bert_base_6_layer_6_connect_freeze_0/pytorch_model_8.bin --config_file config/bert_base_6layer_6conect.json --learning_rate 2e-5 --num_workers 16 --tasks 6-7 --save_name pretrained
```

### Refer Expression
```
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 train_tasks.py --bert_model bert-base-uncased --from_pretrained save/bert_base_6_layer_6_connect_freeze_0/pytorch_model_8.bin --config_file config/bert_base_6layer_6conect.json --learning_rate 4e-5 --num_workers 16 --tasks 11 --save_name pretrained
```

### Image Retrieval
```
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --node_rank=0 train_tasks.py --bert_model bert-base-uncased --from_pretrained save/bert_base_6_layer_6_connect_freeze_0/pytorch_model_8.bin --config_file config/bert_base_6layer_6conect.json --learning_rate 4e-5 --num_workers 9 --tasks 11 --save_name pretrained
```

### Add your own tasks
```
```
## Why does ViLBERT look like <img src="fig/vilbert_trim.png" width="45">?

<p align="center">
<img src="fig/vilbert.png" width="400" >
</p>
1 change: 1 addition & 0 deletions config/bert-base-uncased_weight_name.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
["embeddings.word_embeddings.weight", "embeddings.position_embeddings.weight", "embeddings.token_type_embeddings.weight", "embeddings.LayerNorm.weight", "embeddings.LayerNorm.bias", "encoder.layer.0.attention.self.query.weight", "encoder.layer.0.attention.self.query.bias", "encoder.layer.0.attention.self.key.weight", "encoder.layer.0.attention.self.key.bias", "encoder.layer.0.attention.self.value.weight", "encoder.layer.0.attention.self.value.bias", "encoder.layer.0.attention.output.dense.weight", "encoder.layer.0.attention.output.dense.bias", "encoder.layer.0.attention.output.LayerNorm.weight", "encoder.layer.0.attention.output.LayerNorm.bias", "encoder.layer.0.intermediate.dense.weight", "encoder.layer.0.intermediate.dense.bias", "encoder.layer.0.output.dense.weight", "encoder.layer.0.output.dense.bias", "encoder.layer.0.output.LayerNorm.weight", "encoder.layer.0.output.LayerNorm.bias", "encoder.layer.1.attention.self.query.weight", "encoder.layer.1.attention.self.query.bias", "encoder.layer.1.attention.self.key.weight", "encoder.layer.1.attention.self.key.bias", "encoder.layer.1.attention.self.value.weight", "encoder.layer.1.attention.self.value.bias", "encoder.layer.1.attention.output.dense.weight", "encoder.layer.1.attention.output.dense.bias", "encoder.layer.1.attention.output.LayerNorm.weight", "encoder.layer.1.attention.output.LayerNorm.bias", "encoder.layer.1.intermediate.dense.weight", "encoder.layer.1.intermediate.dense.bias", "encoder.layer.1.output.dense.weight", "encoder.layer.1.output.dense.bias", "encoder.layer.1.output.LayerNorm.weight", "encoder.layer.1.output.LayerNorm.bias", "encoder.layer.2.attention.self.query.weight", "encoder.layer.2.attention.self.query.bias", "encoder.layer.2.attention.self.key.weight", "encoder.layer.2.attention.self.key.bias", "encoder.layer.2.attention.self.value.weight", "encoder.layer.2.attention.self.value.bias", "encoder.layer.2.attention.output.dense.weight", "encoder.layer.2.attention.output.dense.bias", "encoder.layer.2.attention.output.LayerNorm.weight", "encoder.layer.2.attention.output.LayerNorm.bias", "encoder.layer.2.intermediate.dense.weight", "encoder.layer.2.intermediate.dense.bias", "encoder.layer.2.output.dense.weight", "encoder.layer.2.output.dense.bias", "encoder.layer.2.output.LayerNorm.weight", "encoder.layer.2.output.LayerNorm.bias", "encoder.layer.3.attention.self.query.weight", "encoder.layer.3.attention.self.query.bias", "encoder.layer.3.attention.self.key.weight", "encoder.layer.3.attention.self.key.bias", "encoder.layer.3.attention.self.value.weight", "encoder.layer.3.attention.self.value.bias", "encoder.layer.3.attention.output.dense.weight", "encoder.layer.3.attention.output.dense.bias", "encoder.layer.3.attention.output.LayerNorm.weight", "encoder.layer.3.attention.output.LayerNorm.bias", "encoder.layer.3.intermediate.dense.weight", "encoder.layer.3.intermediate.dense.bias", "encoder.layer.3.output.dense.weight", "encoder.layer.3.output.dense.bias", "encoder.layer.3.output.LayerNorm.weight", "encoder.layer.3.output.LayerNorm.bias", "encoder.layer.4.attention.self.query.weight", "encoder.layer.4.attention.self.query.bias", "encoder.layer.4.attention.self.key.weight", "encoder.layer.4.attention.self.key.bias", "encoder.layer.4.attention.self.value.weight", "encoder.layer.4.attention.self.value.bias", "encoder.layer.4.attention.output.dense.weight", "encoder.layer.4.attention.output.dense.bias", "encoder.layer.4.attention.output.LayerNorm.weight", "encoder.layer.4.attention.output.LayerNorm.bias", "encoder.layer.4.intermediate.dense.weight", "encoder.layer.4.intermediate.dense.bias", "encoder.layer.4.output.dense.weight", "encoder.layer.4.output.dense.bias", "encoder.layer.4.output.LayerNorm.weight", "encoder.layer.4.output.LayerNorm.bias", "encoder.layer.5.attention.self.query.weight", "encoder.layer.5.attention.self.query.bias", "encoder.layer.5.attention.self.key.weight", "encoder.layer.5.attention.self.key.bias", "encoder.layer.5.attention.self.value.weight", "encoder.layer.5.attention.self.value.bias", "encoder.layer.5.attention.output.dense.weight", "encoder.layer.5.attention.output.dense.bias", "encoder.layer.5.attention.output.LayerNorm.weight", "encoder.layer.5.attention.output.LayerNorm.bias", "encoder.layer.5.intermediate.dense.weight", "encoder.layer.5.intermediate.dense.bias", "encoder.layer.5.output.dense.weight", "encoder.layer.5.output.dense.bias", "encoder.layer.5.output.LayerNorm.weight", "encoder.layer.5.output.LayerNorm.bias", "encoder.layer.6.attention.self.query.weight", "encoder.layer.6.attention.self.query.bias", "encoder.layer.6.attention.self.key.weight", "encoder.layer.6.attention.self.key.bias", "encoder.layer.6.attention.self.value.weight", "encoder.layer.6.attention.self.value.bias", "encoder.layer.6.attention.output.dense.weight", "encoder.layer.6.attention.output.dense.bias", "encoder.layer.6.attention.output.LayerNorm.weight", "encoder.layer.6.attention.output.LayerNorm.bias", "encoder.layer.6.intermediate.dense.weight", "encoder.layer.6.intermediate.dense.bias", "encoder.layer.6.output.dense.weight", "encoder.layer.6.output.dense.bias", "encoder.layer.6.output.LayerNorm.weight", "encoder.layer.6.output.LayerNorm.bias", "encoder.layer.7.attention.self.query.weight", "encoder.layer.7.attention.self.query.bias", "encoder.layer.7.attention.self.key.weight", "encoder.layer.7.attention.self.key.bias", "encoder.layer.7.attention.self.value.weight", "encoder.layer.7.attention.self.value.bias", "encoder.layer.7.attention.output.dense.weight", "encoder.layer.7.attention.output.dense.bias", "encoder.layer.7.attention.output.LayerNorm.weight", "encoder.layer.7.attention.output.LayerNorm.bias", "encoder.layer.7.intermediate.dense.weight", "encoder.layer.7.intermediate.dense.bias", "encoder.layer.7.output.dense.weight", "encoder.layer.7.output.dense.bias", "encoder.layer.7.output.LayerNorm.weight", "encoder.layer.7.output.LayerNorm.bias", "encoder.layer.8.attention.self.query.weight", "encoder.layer.8.attention.self.query.bias", "encoder.layer.8.attention.self.key.weight", "encoder.layer.8.attention.self.key.bias", "encoder.layer.8.attention.self.value.weight", "encoder.layer.8.attention.self.value.bias", "encoder.layer.8.attention.output.dense.weight", "encoder.layer.8.attention.output.dense.bias", "encoder.layer.8.attention.output.LayerNorm.weight", "encoder.layer.8.attention.output.LayerNorm.bias", "encoder.layer.8.intermediate.dense.weight", "encoder.layer.8.intermediate.dense.bias", "encoder.layer.8.output.dense.weight", "encoder.layer.8.output.dense.bias", "encoder.layer.8.output.LayerNorm.weight", "encoder.layer.8.output.LayerNorm.bias", "encoder.layer.9.attention.self.query.weight", "encoder.layer.9.attention.self.query.bias", "encoder.layer.9.attention.self.key.weight", "encoder.layer.9.attention.self.key.bias", "encoder.layer.9.attention.self.value.weight", "encoder.layer.9.attention.self.value.bias", "encoder.layer.9.attention.output.dense.weight", "encoder.layer.9.attention.output.dense.bias", "encoder.layer.9.attention.output.LayerNorm.weight", "encoder.layer.9.attention.output.LayerNorm.bias", "encoder.layer.9.intermediate.dense.weight", "encoder.layer.9.intermediate.dense.bias", "encoder.layer.9.output.dense.weight", "encoder.layer.9.output.dense.bias", "encoder.layer.9.output.LayerNorm.weight", "encoder.layer.9.output.LayerNorm.bias", "encoder.layer.10.attention.self.query.weight", "encoder.layer.10.attention.self.query.bias", "encoder.layer.10.attention.self.key.weight", "encoder.layer.10.attention.self.key.bias", "encoder.layer.10.attention.self.value.weight", "encoder.layer.10.attention.self.value.bias", "encoder.layer.10.attention.output.dense.weight", "encoder.layer.10.attention.output.dense.bias", "encoder.layer.10.attention.output.LayerNorm.weight", "encoder.layer.10.attention.output.LayerNorm.bias", "encoder.layer.10.intermediate.dense.weight", "encoder.layer.10.intermediate.dense.bias", "encoder.layer.10.output.dense.weight", "encoder.layer.10.output.dense.bias", "encoder.layer.10.output.LayerNorm.weight", "encoder.layer.10.output.LayerNorm.bias", "encoder.layer.11.attention.self.query.weight", "encoder.layer.11.attention.self.query.bias", "encoder.layer.11.attention.self.key.weight", "encoder.layer.11.attention.self.key.bias", "encoder.layer.11.attention.self.value.weight", "encoder.layer.11.attention.self.value.bias", "encoder.layer.11.attention.output.dense.weight", "encoder.layer.11.attention.output.dense.bias", "encoder.layer.11.attention.output.LayerNorm.weight", "encoder.layer.11.attention.output.LayerNorm.bias", "encoder.layer.11.intermediate.dense.weight", "encoder.layer.11.intermediate.dense.bias", "encoder.layer.11.output.dense.weight", "encoder.layer.11.output.dense.bias", "encoder.layer.11.output.LayerNorm.weight", "encoder.layer.11.output.LayerNorm.bias"]
1 change: 1 addition & 0 deletions config/bert-large-uncased_weight_name.json

Large diffs are not rendered by default.

30 changes: 30 additions & 0 deletions config/bert_base_2layer_2conect.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
{
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 30522,
"v_feature_size": 2048,
"v_target_size": 1601,
"v_hidden_size": 1024,
"v_num_hidden_layers":2,
"v_num_attention_heads":8,
"v_intermediate_size":1024,
"bi_hidden_size":1024,
"bi_num_attention_heads":8,
"bi_intermediate_size": 1024,
"bi_attention_type":1,
"v_attention_probs_dropout_prob":0.1,
"v_hidden_act":"gelu",
"v_hidden_dropout_prob":0.1,
"v_initializer_range":0.02,
"v_biattention_id":[0, 1],
"t_biattention_id":[10, 11],
"pooling_method": "mul"
}
30 changes: 30 additions & 0 deletions config/bert_base_4layer_4conect.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
{
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 30522,
"v_feature_size": 2048,
"v_target_size": 1601,
"v_hidden_size": 1024,
"v_num_hidden_layers":4,
"v_num_attention_heads":8,
"v_intermediate_size":1024,
"bi_hidden_size":1024,
"bi_num_attention_heads":8,
"bi_intermediate_size": 1024,
"bi_attention_type":1,
"v_attention_probs_dropout_prob":0.1,
"v_hidden_act":"gelu",
"v_hidden_dropout_prob":0.1,
"v_initializer_range":0.02,
"v_biattention_id":[0, 1, 2, 3],
"t_biattention_id":[8, 9, 10, 11],
"pooling_method": "mul"
}
30 changes: 30 additions & 0 deletions config/bert_base_6layer_6conect.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
{
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 30522,
"v_feature_size": 2048,
"v_target_size": 1601,
"v_hidden_size": 1024,
"v_num_hidden_layers":6,
"v_num_attention_heads":8,
"v_intermediate_size":1024,
"bi_hidden_size":1024,
"bi_num_attention_heads":8,
"bi_intermediate_size": 1024,
"bi_attention_type":1,
"v_attention_probs_dropout_prob":0.1,
"v_hidden_act":"gelu",
"v_hidden_dropout_prob":0.1,
"v_initializer_range":0.02,
"v_biattention_id":[0, 1, 2, 3, 4, 5],
"t_biattention_id":[6, 7, 8, 9, 10, 11],
"pooling_method": "mul"
}
30 changes: 30 additions & 0 deletions config/bert_base_8layer_8conect.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
{
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 30522,
"v_feature_size": 2048,
"v_target_size": 1601,
"v_hidden_size": 1024,
"v_num_hidden_layers":8,
"v_num_attention_heads":8,
"v_intermediate_size":1024,
"bi_hidden_size":1024,
"bi_num_attention_heads":8,
"bi_intermediate_size": 1024,
"bi_attention_type":1,
"v_attention_probs_dropout_prob":0.1,
"v_hidden_act":"gelu",
"v_hidden_dropout_prob":0.1,
"v_initializer_range":0.02,
"v_biattention_id":[0, 1, 2, 3, 4, 5, 6, 7],
"t_biattention_id":[4, 5, 6, 7, 8, 9, 10, 11],
"pooling_method": "mul"
}
13 changes: 13 additions & 0 deletions config/bert_base_baseline.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 512,
"num_attention_heads": 12,
"num_hidden_layers": 12,
"type_vocab_size": 2,
"vocab_size": 30522
}
30 changes: 30 additions & 0 deletions config/bert_large_2layer_2conect.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
{
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 1024,
"initializer_range": 0.02,
"intermediate_size": 4096,
"max_position_embeddings": 512,
"num_attention_heads": 16,
"num_hidden_layers": 24,
"type_vocab_size": 2,
"vocab_size": 30522,
"v_feature_size": 2048,
"v_target_size": 1601,
"v_hidden_size":1024,
"v_num_hidden_layers":2,
"v_num_attention_heads":8,
"v_intermediate_size":1024,
"bi_hidden_size":1024,
"bi_num_attention_heads":8,
"bi_intermediate_size": 1024,
"bi_attention_type":1,
"v_attention_probs_dropout_prob":0.1,
"v_hidden_act":"gelu",
"v_hidden_dropout_prob":0.1,
"v_initializer_range":0.02,
"v_biattention_id":[0, 1],
"t_biattention_id":[22, 23],
"pooling_method": "mul"
}
30 changes: 30 additions & 0 deletions config/bert_large_4layer_4conect.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
{
"attention_probs_dropout_prob": 0.1,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 1024,
"initializer_range": 0.02,
"intermediate_size": 4096,
"max_position_embeddings": 512,
"num_attention_heads": 16,
"num_hidden_layers": 24,
"type_vocab_size": 2,
"vocab_size": 30522,
"v_feature_size": 2048,
"v_target_size": 1601,
"v_hidden_size":1024,
"v_num_hidden_layers":4,
"v_num_attention_heads":8,
"v_intermediate_size":1024,
"bi_hidden_size":1024,
"bi_num_attention_heads":8,
"bi_intermediate_size": 1024,
"bi_attention_type":1,
"v_attention_probs_dropout_prob":0.1,
"v_hidden_act":"gelu",
"v_hidden_dropout_prob":0.1,
"v_initializer_range":0.02,
"v_biattention_id":[0, 1, 2, 3],
"t_biattention_id":[20, 21, 22, 23],
"pooling_method": "mul"
}
Loading

0 comments on commit 86caf86

Please sign in to comment.