Name: Each feature file is named as [File ID].npy
which corresponds to the file ID in Flickr30K Entities.
Proposal generation: We use Selective Search to generate proposals for each image in Flickr30K Entities. For Referit Game dataset, we use Edge Box to generate proposals for each image. We select top 100
proposals in each image.
Feature extractor: We apply a Faster-RCNN network pre-trained on PASCAL VOC 2012 for Flickr30K Entities and pre-trained on ImageNet for Referit Game. The visual feature for each image in these two datasets is represented as a 100 x 4096
matrix. Each row corresponds to visual feature (fc7
layer of Faster-RCNN) in each proposal bounding box.