Skip to content

Exercise on captioning images in the Neural Networks for Computer Vision course. Using the Flickr8K dataset, and simple encoder-decoder architecture. Evaluation based on Cross-Entropy loss and 4-gram Bleu score.

Notifications You must be signed in to change notification settings

GuyKabiri/Image-Caption

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Image Captioning

The following notebook is an exercise for the Convolutional Neural Networks for Computer Vision course at Afeka College of Engineering.
It uses Flickr8K dataset for image captioning.

Submitted By:

  • Tal Goldengoren
  • Guy Kabiri

Table of Contents:

Imports

from dataloader import *
from model import *
from train import *
from utils import *

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
assert torch.cuda.is_available()
np.random.seed(CFG.seed)
torch.manual_seed(CFG.seed)
torch.cuda.manual_seed(CFG.seed)

Data Exploration

The dataset used in this exercise was Flickr8K.
It contains about 8,000 images, with 5 different captions each. Therefore, a total of about 40,000 captions.
As each image may be described in different ways by different people, having more than 1 caption for each image will assist in better training and evaluating the correctness of the predictions.

loader = get_loaders(batch_size=1, phase='test')['test']

captions_file = "data/flickr8k/captions.txt"
df = pd.read_csv(captions_file)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40455 entries, 0 to 40454
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   image    40455 non-null  object
 1   caption  40455 non-null  object
dtypes: object(2)
memory usage: 632.2+ KB

Some captions examples

df
image caption
0 1000268201_693b08cb0e.jpg A child in a pink dress is climbing up a set o...
1 1000268201_693b08cb0e.jpg A girl going into a wooden building .
2 1000268201_693b08cb0e.jpg A little girl climbing into a wooden playhouse .
3 1000268201_693b08cb0e.jpg A little girl climbing the stairs to her playh...
4 1000268201_693b08cb0e.jpg A little girl in a pink dress going into a woo...
... ... ...
40450 997722733_0cb5439472.jpg A man in a pink shirt climbs a rock face
40451 997722733_0cb5439472.jpg A man is rock climbing high in the air .
40452 997722733_0cb5439472.jpg A person in a red shirt climbing up a rock fac...
40453 997722733_0cb5439472.jpg A rock climber in a red shirt .
40454 997722733_0cb5439472.jpg A rock climber practices on a rock climbing wa...

40455 rows × 2 columns

As can be seen above, each image has 5 captions, it means that during training all 5 captions should be taken into account when evaluating models performance.

num_images = len(df.image.unique())
train_img_size, valid_img_size, test_img_size = int(num_images*CFG.train_size), int(num_images*(1-CFG.train_size)/2), int(num_images*(1-CFG.train_size)/2)
train_cap_size, valid_cap_size, test_cap_size = train_img_size*5, valid_img_size*5, test_img_size*5
print('There are {} images in the dataset'.format(num_images))
print('Training set will contain {} images and {} captions'.format(train_img_size, train_cap_size))
print('Validation set will contain {} images and {} captions'.format(valid_img_size, valid_cap_size))
print('Test set will contain {} images and {} captions'.format(test_img_size, test_cap_size))
There are 8091 images in the dataset
Training set will contain 6068 images and 30340 captions
Validation set will contain 1011 images and 5055 captions
Test set will contain 1011 images and 5055 captions

Understand the Data

loader_iter = iter(loader)
_, caps, _ = next(loader_iter)
print(caps)
tensor([[[  1,   1,   1,   1,   1]],

        [[  4,   4,  10,  10, 431]],

        [[ 30, 431, 431,  21, 335]],

        [[  6,   6,  30,   6,   6]],

        [[ 29,  17,   6,  17,  29]],

        [[ 37, 324,  17,  29,  37]],

        [[ 10,  37,  29,   8,  44]],

        [[ 44, 423,   8,  10,   2]],

        [[  5,  44,  10, 423,   0]],

        [[  2,   5,  44,  44,   0]],

        [[  0,   2,   5,   5,   0]],

        [[  0,   0,   2,   2,   0]]])
for _ in range(2):
    batch = next(loader_iter)
    imgs, caps, _ = batch
    print('Images shape: {}'.format(imgs.shape))
    print('Captions shape: {}'.format(caps.shape))
    print()
Images shape: torch.Size([1, 3, 224, 224])
Captions shape: torch.Size([16, 1, 5])

Images shape: torch.Size([1, 3, 224, 224])
Captions shape: torch.Size([21, 1, 5])

The data will be provided to the model as follow:
Images: [B, C, H, W]
Captions: [MS, B, NC]

B=batch size
MS=max sentence length
NC=number of captions per image

As the images shape is quite understandable, the captions is a bit weird.
This shape is due to the different sentences length between the different samples.
When working with batches, the samples whitin each batch should be equals size, therefore, it is not possible to represent sentences with different lengths with a normal shape, and much easier to padding short sentences in that shape.
The first sentence present along the first column of the matrix, the second sentence in the second column, and so on.

Data Processing

Image Processing

The images in the dataset are variety in shapes.
The backbone model which will be used in this architecture, will be a pre-trained model (ImageNet), therefore all the images will be resized into 224X244 shape.
Also, because the model is pre-trained, the images will be normalized into ImageNet mean and std values.

Captions Processing

As nueral networks understand only numbers, and not words, all of the captions need to be transformed into numbers.
It means that each unique word in the dataset should get a unique number to reprenet it.
For this task, a pre-build vocabulary will be used, this vocabulary contains a large amount of words, each will be mapped into a unique number.
As dataset may contains words that appear only once in captions, the model will have hard time learning such words.
Therefore, only frequent words will be taking into account, while leaving the un-common words out, this can be addjust by a threshold, which means it is another hyper-parameter that can be tuned.
Moreover, the tokkenized vocabulary will hold a several unique words that have a special meaning:

  • <SOS> - Start of sentence
  • <EOS> - End of sentence
  • <PAD> - Paddind to generate equal size captions during training
  • <UKN> - Any word under the frequent threshold

Data Samples

def get_sentences(dataloader, captions):
    captions = captions.squeeze(1)
    captions = torch.permute(captions, (1, 0))
    num_sentences, num_word = captions.shape
    sentences = []
    for i in range(num_sentences):
        words = [ dataloader.dataset.vocab.itos[int(word)] for word in captions[i] ]    #   convert tokenizes to words
        eos_index = words.index('<EOS>')        #   find index of <EOS>
        words = words[1 : eos_index]            #   remove <SOS> and <EOS>
        sen = ' '.join(words)
        sentences.append(sen)

    return sentences
def show_example(dataloader, rows=4, cols=2):
    num_examples = cols*rows
    global_offset = 14
    font_size = 12
    transform = get_transformer('print')


    fig, axs = plt.subplots(rows, cols, sharey=True, figsize=(10*cols, 10*rows))
    for idx, (_, captions, img_id) in enumerate(dataloader):
        if idx == num_examples:
            break

        img = transform(Image.open('data/flickr8k/images/' + img_id[0]).convert('RGB'))

        # img = img.squeeze(0)
        img = np.transpose(img, (1, 2, 0))
        sentences = get_sentences(dataloader, captions)

        ridx, cidx = idx//cols, idx%cols
        axs[ridx, cidx].imshow(norm(img))
        offset = global_offset
        for sen in sentences:
            axs[ridx, cidx].text(2, offset, sen, fontsize=font_size, color='white', bbox=dict(facecolor='black', alpha=0.5))
            offset += global_offset
        axs[ridx, cidx].axis('off')
    plt.tight_layout()
    plt.show()

show_example(loader)

1

Training

The training process involved several configuration and trials:
Two backbone modleds were tested for the encoder, Resnet-152 and InceptionV3.
Various amount of LSTM layers were tested from 2, up to 5.
Several learning rates, as well as, different number of epochs and batch sizes.

For final configuration the following was used:

  • Backbone: InceptionV3
  • Embedded Size: 512
  • Hidden Size: 512
  • LSTM Layers: 3
  • Batch Size: 32
  • learning_rate: 1e-4
  • num_epochs: 150
  • drop_rate: 0.5
  • Criterion: CrossEntropyLoss
  • Optimizer: Adam
  • Scheduler: ReduceLROnPlateau w/ factor=0.8, patience=2

The backbone was a pre-trained model, and it was not trained during the training phase.

Training Process

During training, first, an image goes through the CNN model in order to extract its features.
After extracting features, a linear layer will be used to map the features into the vocabulary embedding size, with a dropout layer on top of it for bettrer training.
Later on, this linear layer inserted into the decoder, which will pass the output of the embedding layer into certain amount of LSTM layers, in order to generate sequence of words.
For final prediction, a linear layer with the size of the vucabulary will be used to map the prediction to the correct words.

Graphs

1-gram

1-gram

2-gram

2-gram

3-gram

3-gram

4-gram

4-gram

Loss

loss

Learning Rate

lr

Prediction Results

# test_path = 'runs/26-11-21_10:36/test'
# test_path = 'runs/26-11-21_16:03/test'
test_path = 'runs/26-11-21_20:20/test'
test_df = pd.read_csv(test_path + '/test.csv')
test_df
Unnamed: 0 image prediction loss 1-gram 2-gram 3-gram 4-gram
0 0 3150659152_2ace03690b.jpg <SOS> a man is standing on a rock overlooking ... 3.138403 0.636364 0.356753 0.112244 0.064841
1 1 2222498879_9e82a100ab.jpg <SOS> a dog is jumping over a hurdle . <EOS> 1.556955 0.625000 0.422577 0.143842 0.087836
2 2 3126752627_dc2d6674da.jpg <SOS> a football player in a red uniform is ru... 1.948640 0.427367 0.181596 0.065234 0.040041
3 3 3257207516_9d2bc0ea04.jpg <SOS> a man in a black shirt and a woman in a ... 3.116272 0.357143 0.230022 0.182766 0.125008
4 4 2289096282_4ef120f189.jpg <SOS> a man and a woman are sitting on a bench... 3.108447 0.411765 0.160422 0.055566 0.033272
... ... ... ... ... ... ... ... ...
1007 1007 1303727066_23d0f6ed43.jpg <SOS> a man in a black shirt and a woman in a ... 3.242519 0.230769 0.096077 0.033755 0.020222
1008 1008 534886684_a6c9f40fa1.jpg <SOS> a man in a black shirt and jeans is stan... 2.602398 0.529412 0.363803 0.095914 0.050105
1009 1009 2431723485_bc6b8e6418.jpg <SOS> a man in a red shirt and a black dog are... 2.363976 0.394458 0.203299 0.061354 0.034292
1010 1010 3373481779_511937e09d.jpg <SOS> a man in a red shirt and white shorts is... 2.990329 0.500000 0.196116 0.068436 0.041316
1011 1011 3265964840_5374ed9c53.jpg <SOS> a man in a red jacket is riding a bike o... 1.752711 0.600000 0.462910 0.320647 0.228942

1012 rows × 8 columns

def get_clean_sentence(sentence):
    stopwords = ['<SOS>', '<EOS>', '.']
    words_list = sentence.split()
    resultwords = [word for word in words_list if word.upper() not in stopwords]
    return ' '.join(resultwords)

def get_two_line_sentence(sentence, max_words=18):
    new_sen = sentence.split()
    return ' '.join(new_sen[ : max_words]) + '\n' + ' '.join(new_sen[ max_words : ])

def get_plot_sentence(sentence, max_words=18):
    clean_sentence = get_clean_sentence(sentence)
    if len(clean_sentence.split()) > max_words:
        return get_two_line_sentence(clean_sentence, max_words), True
    return clean_sentence, False

def show_example(dataloader, df, rows=4, cols=2):
    num_examples = cols*rows
    global_offset = 14
    font_size = 12
    max_words = 18

    transform = get_transformer('print')

    examples_df = df[ : num_examples]
    
    fig, axs = plt.subplots(rows, cols, sharey=True, figsize=(10*cols, 10*rows))
    for i in range(num_examples):
        img_id = examples_df.iloc[i]['image']
        img = transform(Image.open('data/flickr8k/images/' + img_id).convert('RGB'))

        img_index = np.where(np.array(dataloader.dataset.images) == img_id)[0][0]
        captions = dataloader.dataset.__getitem__(img_index)[1]

        img = np.transpose(img, (1, 2, 0))
        sentences = get_sentences(dataloader, captions)

        ridx, cidx = i//cols, i%cols
        axs[ridx, cidx].imshow(norm(img))
        offset = global_offset
        for sen in sentences:
            sen, two_lines = get_plot_sentence(sen, max_words)
            if two_lines:
                offset += global_offset//1.5
            axs[ridx, cidx].text(2, offset, sen, fontsize=font_size, color='white', bbox=dict(facecolor='black', alpha=0.5))
            offset += global_offset            
        
        df_img = test_df[test_df['image']==img_id]
        pred = df_img['prediction'].item()
        pred, two_lines = get_plot_sentence(pred, max_words)
        if two_lines:
            offset += global_offset//1.5

        axs[ridx, cidx].text(2, offset, pred, fontsize=font_size, color='black', bbox=dict(facecolor='white', alpha=0.5))

        filter_col = [col for col in df_img if col.endswith('-gram')]
        offset = img.size(1) - ((len(filter_col) + 1) *  global_offset)

        loss = df_img['loss'].item()
        title = 'loss: {:.5f}'.format(loss)
        axs[ridx, cidx].text(2, offset, title, fontsize=font_size, color='black', bbox=dict(facecolor='white', alpha=0.5))
        offset += global_offset
        
        for col in filter_col:
            score = df_img[col].item()
            title = '{}: {:.5f}'.format(col, score)
            axs[ridx, cidx].text(2, offset, title, fontsize=font_size, color='black', bbox=dict(facecolor='white', alpha=0.5))
            offset += global_offset

        axs[ridx, cidx].axis('off')
    plt.tight_layout()
    plt.show()

Good Predictions

test_df = test_df.sort_values(by=['1-gram', 'loss'], ascending=False)
show_example(loader, test_df.drop_duplicates(subset=['prediction']))

2

Bad Predictions

test_df = test_df.sort_values(by=['1-gram', 'loss'], ascending=True)
show_example(loader, test_df.drop_duplicates(subset=['prediction']))

3

Refereneces

About

Exercise on captioning images in the Neural Networks for Computer Vision course. Using the Flickr8K dataset, and simple encoder-decoder architecture. Evaluation based on Cross-Entropy loss and 4-gram Bleu score.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published