Skip to content

With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning. ICCV 2023

License

Notifications You must be signed in to change notification settings

aimagelab/PMA-Net

Repository files navigation

PMA-Net: Prototypical Memory Attention Network
(ICCV 2023)

This repository contains the reference code for the paper With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning.

Please cite with the following BibTeX:

@inproceedings{sarto2023positive,
  title={{With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning}},
  author={Barraco, Manuele and Sarto, Sara and Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
  year={2023}
}

PMA-Net

Environment Setup

Clone the repository and create the pma-net conda environment using the environment.yml file:

conda env create -f environment.yml
conda activate pma-net

Note: Python 3.9 is required to run our code.

Data Preparation

Checkpoints

XE and SCST checkpoints are available at the following links:

Model Checkpoint
PMA-Net XE pma-net_xe.tar
PMA-Net SCST pma-net_scst.tar

Download, extract, and place them in a folder anywhere. The path {CHECKPOINT_FOLDER} will be set as argument later.

Dataset

To run the code, annotations for the COCO dataset are needed. Please download the zip files containing the annotations (annotations.zip), extract them, and place them under the datasets/annotations folder.

To train and test our model, download the tar files containing the already extracted COCO image features using CLIP ViT-L/14 at the following links:

Split Checkpoint
COCO Training (chunck 0) coco_training_CLIP-ViT-L14_cached_0.tar
COCO Training (chunck 1) coco_training_CLIP-ViT-L14_cached_1.tar
COCO Training (chunck 2) coco_training_CLIP-ViT-L14_cached_2.tar
COCO Training (chunck 3) coco_training_CLIP-ViT-L14_cached_3.tar
COCO Training (chunck 4) coco_training_CLIP-ViT-L14_cached_4.tar
COCO Training (chunck 5) coco_training_CLIP-ViT-L14_cached_5.tar
COCO Training for SCST coco_training_dict_CLIP-ViT-L14_cached.tar
COCO Validation coco_validation_dict_CLIP-ViT-L14_cached.tar
COCO Test coco_test_dict_CLIP-ViT-L14_cached.tar

Once the files are downloaded and extracted in a single folder, set the correct path in the configs/datasets/datasets.json.

These paths will be set as arguments later.

Evaluation

To evaluate our best model, use

torchrun --nproc_per_node {N_GPUS} --master_port {MASTER_PORT} main.py --do_eval --do_predict --predict_with_generate --output_dir {OUTPUT_DIR} --validation_datasets coco_validation_dict_CLIP-ViT-L14_cached --test_datasets coco_test_dict_CLIP-ViT-L14_cached --evaluation_strategy steps --generation_max_length 30 --generation_num_beams 5 --per_device_eval_batch_size {EVAL_BATCH_SIZE} --kmeans_memory --add_memory_slots_selfattn --n_memory_slots 1024 --deque_iters 1500 --window 0.25 --resume_from_checkpoint {CHECKPOINT_FOLDER}

Training Procedure

To train our best model with the parameters used in our experiments, use

torchrun --nproc_per_node {N_GPUS} --master_port {MASTER_PORT} main.py --do_train --do_eval --do_predict --predict_with_generate --output_dir {OUTPUT_DIR} --train_datasets coco_CLIP-ViT-L14_cached --validation_datasets coco_validation_dict_CLIP-ViT-L14_cached --test_datasets coco_test_dict_CLIP-ViT-L14_cached --evaluation_strategy steps 
--eval_steps 1000 --save_steps 1000 --max_steps -1 --generation_max_length 30 --generation_num_beams 5 --per_device_train_batch_size {TRAIN_BATCH_SIZE} --per_device_eval_batch_size {EVAL_BATCH_SIZE} --custom_lr_scheduler CustomScheduler --steps_min 15000 --start_decreasing_steps 10000 --learning_rate 2.5e-4 --warmup_steps 1000 --lr_min 1e-5 --gradient_accumulation_steps 8 --deepspeed configs/deepspeed/config_lamb_zero2.json --encoder --kmeans_memory --add_memory_slots_selfattn --n_memory_slots 1024 --deque_iters 1500 --window 0.25 

After XE pre-training, for the SCST step use:

torchrun --nproc_per_node {N_GPUS} --master_port {MASTER_PORT} main.py --do_train --do_eval --do_predict --predict_with_generate --output_dir {OUTPUT_DIR} --train_datasets coco_training_dict_CLIP-ViT-L14_cached --validation_datasets coco_validation_dict_CLIP-ViT-L14_cached --test_datasets coco_test_dict_CLIP-ViT-L14_cached --evaluation_strategy steps 
--eval_steps 1000 --save_steps 1000 --max_steps -1 --generation_max_length 30 --generation_num_beams 5 --per_device_train_batch_size {TRAIN_BATCH_SIZE} --per_device_eval_batch_size {EVAL_BATCH_SIZE} --steps_min 15000  --learning_rate 5e-6 --gradient_accumulation_steps 8 --deepspeed configs/deepspeed/config_adam_zero2.json --encoder --kmeans_memory --add_memory_slots_selfattn --n_memory_slots 1024 --deque_iters 1500 --window 0.25 --scst --resume_from_checkpoint {CHECKPOINT_FOLDER}

Custom Arguments

The complete arguments list for using our code:

Argument Description
--encoder Add a BERT encoder.
--n_layer Number of layer.
--n_embd Embedding dimension.
--n_head Number of head.
--custom_checkpoint_keeper How many checkpoints keep on drive, default is 5.
--scst Use SCST phase.
--train_datasets Training datasets, default is coco_training.
--validation_datasets Validation datasets, default is coco_validation_dict.
--test_datasets Test datasets, default is coco_test_dict.
--scst_datasets SCST datasets, default is coco_training_dict.
--custom_lr_scheduler Which custom scheduler uses (CustomScheduler, TransformerScheduler), default is None.
--lr_multiplier Learning rate multiplier, default is 1.0.
--steps_min Only with CustomScheduler.
--lr_min Only with CustomScheduler.
--start_decreasing_steps Only with CustomScheduler.
--add_memory_slots_selfattn Add memory slots in the self-attention blocks.
--n_memory_slots How many memory slots, default is 64.
--freeze_memory Freeze the memories.
--kmeans_memory Compute the memories using k-means.
--deque_iters Max number of iterations data in the deque, default is 10.
--window Overlap window of new data, default is None.