Image Captioning using Attention Mechanism

Published in

The Startup

14 min readMar 4, 2020

Introduction

Caption generation is a challenging artificial intelligence problem where a textual description must be generated for a given photograph.It requires both methods from computer vision to understand the content of the image and a language model from the field of natural language processing to turn the understanding of the image into words in the right order.

A “classic” image captioning system would encode the image, using a pre-trained Convolutional Neural Network(ENCODER) that would produce a hidden state h.

Then, it would decode this hidden state by using a LSTM(DECODER) and generate recursively each word of the caption.

Deep learning methods have demonstrated state-of-the-art results on caption generation problems. What is most impressive about these methods is a single end-to-end model can be defined to predict a caption, given a photo, instead of requiring sophisticated data preparation or a pipeline of specifically designed models.

Problem with ‘Classic’ Image Captioning Model

The problem with this method is that, when the model is trying to generate the next word of the caption, this word is usually describing only a part of the image. It is unable to capture the essence of the entire input image. Using the whole representation of the image h to condition the generation of each word cannot efficiently produce different words for different parts of the image.This is exactly where an Attention mechanism is helpful.

Concept of Attention Mechanism:

With an Attention mechanism, the image is first divided into n parts, and we compute with a Convolutional Neural Network (CNN) representations of each part h1,…, hn. When the RNN is generating a new word, the attention mechanism is focusing on the relevant part of the image, so the decoder only uses specific parts of the image.

Image Captioning using Attention Mechanism

We can recognize the figure of the “classic” model for image captioning, but with a new layer of attention model. What is happening when we want to predict the new word of the caption? If we have predicted i words, the hidden state of the LSTM is hi. We select the « relevant » part of the image by using hi as the context. Then, the output of the attention model zi, which is the representation of the image filtered such that only the relevant parts of the image remains, is used as an input for the LSTM. Then, the LSTM predicts a new word and returns a new hidden state hi+1.

Types of Attention Mechanism :

Attention could be broadly differentiated into 2 types:

Global Attention(Luong’s Attention): Attention is placed on all source positions.
Local Attention(Bahdanau Attention): Attention is placed only on a few source positions.

Both attention based models differ from the normal encoder-decoder architecture only in the decoding phase. These attention based methods differ in the way that they compute context vector (c(t)).

Few Explanations :

Global Attention

Global attention takes into consideration all encoder hidden states to derive the context vector (c(t)). In order to calculate c(t), we compute a(t) which is a variable length alignment vector. The alignment vector is derived by computing a similarity measure between h(t) and h_bar(s) where h(t) is the source hidden state while h_bar(s) is the target hidden state. Similar states in encoder and decoder are actually referring to the same meaning.

2. Local Attention

As Global attention focus on all source side words for all target words, it is computationally very expensive and is impractical when translating for long sentences. To overcome this deficiency local attention chooses to focus only on a small subset of the hidden states of the encoder per target word.

Let’s discuss on how Attention Mechanism works

For images, we typically use representations from one of the fully connected layers. But suppose as shown in below figure, a man is throwing a frisbee.

So, when I say the word ‘man’ that means we need to focus only on man in the image ,and when I say the word ‘throwing’ then we have to focus on his hand in the image. Similarly , when we say ‘frisbee’ we have to focus only on the frisbee in the image. This means ‘man’, ‘throwing’ and ‘frisbee’ comes from different pixels in image. But the VGG-16 representation we used does not contain any location information in it.

But every location of convolution layers corresponds to some location of image as shown below.

Now, for example, the output of the 5th convolution layer of VGGNet is a 14*14*512 size feature map.

This 5th convolution layer has 14*14 pixel locations which corresponds to certain portion in image, that means we have 196 such pixel locations.

And finally, we can treat these 196 locations(each having 512 dimensional representation) .

The model will then learn an attention over these locations(which in turn corresponds to actual locations in the images).

As shown in the above figure 5th convolution block is represented by 196 locations which can be passed in different time step.

Let’s discuss the EQUATIONS :

Let’s discuss equations for Local Attention and Global Attention with General score :

Then how it works so well ?

It works because it is a better modelling technique.
This is a more informed model.
We are essentially asking the model to approach the problem in a better (more natural) way.
Given, enough data it should be able to learn these attention weights just as humans do.
And in practice indeed these models work better than the vanilla Encoder-Decoder models.

Few examples :

On the figure below , we can see for each word of the caption what part of the image (in white) is used to generate it.

For more examples, we can look at the “relevant” part of these images to generate the underlined words.

Data Acquisition

There are many open source datasets available for this problem, like Flickr 8k (containing8k images), Flickr 30k (containing 30k images), MS COCO (containing 180k images), etc.

But for the purpose of this case study, I have used the Flickr 8k dataset which you can download from here. Also training a model with large number of images may not be feasible on a system which is not a very high end PC/Laptop.

This dataset contains 8000 images each with 5 captions (as we have already seen in the Introduction section that an image can have multiple captions, all being relevant simultaneously).

These images are bifurcated as follows:

Training Set — 6000 images
Dev Set — 1000 images
Test Set — 1000 images

Let me walk you through the CODE:

Utility Functions :

To load the file/document.
To clean data i.e. removing punctuations,single characters,numeric values from text.

Let’s view the data that is loaded :

Let’s create a dataframe out of this raw text data :

Exploratory Data Analysis

Let’s explore the data to gain some knowledge so that we can approach the problem better.

Plotting the images and their respective captions for better visualisation

Let’s view the word count to find the frequency of words in out dataset

Preprocessing of Images and Captions

Here we are setting the path for each image so that we can load the images at once using the path set

Preprocessing the captions (adding ‘<start>’ and ‘<end>’ tags to every caption), so that out ML model understands the starting and ending of each caption.

Now let’s reshape the image size to 224x224x3 since we will be using VGG-16 model(transfer learning)

Defining the pre-trained Image Model (VGG-16):

The following creates an instance of the VGG16 model using the Keras API. This automatically downloads the required files if you don’t have them already.

The VGG16 model was pre-trained on the ImageNet data-set for classifying images. The VGG16 model contains a convolutional part and a fully-connected (or dense) part which is used for the image classification.
If include_top=True then the whole VGG16 model is downloaded which is about 528 MB. If include_top=False then only the convolutional part of the VGG16 model is downloaded which is just 57 MB.
We will use some of the fully-connected layers in this pre-trained model, so we have to download the full model, but if you have a slow internet connection, then you can try and modify the code below to use the smaller pre-trained model without the classification layers.

Now, let’s prepare our images and create our image dataset i.e. we have to reshape every image to 224x224x3 shape before feeding it to VGG-16 model.

For captions let’s perform tokenization and create vocabulary.

Here we will tokenise the captions and create vocabulary of words present in our data corpus.
Then we will create vector notations for each word in our vocabulary.
N.B. For words not appearing in the vocabulary we will give it <unk> notation

Let’s compare the sequence representation of each word in our corpus.

Now that we have got the sequences to the words in our captions, the sequences are of different length. So, we need pad the sequences to the maximum length of the captions.

To find max and min length of the captions

Train-Test Split

Splitting the dataset(image and captions) into 80:20 ratio i.e.[train:test]

Defining ENCODER (VGG-16) Model

Implementing Attention Mechanism and GRU DECODER

1. Global Attention(Luong’s Attention)

The entire step-by-step process of applying Attention in Luong’s paper is as follows:

Producing the Encoder Hidden States — Encoder produces hidden states of each element in the input sequence
Decoder RNN — the previous decoder hidden state and decoder output is passed through the Decoder RNN to generate a new hidden state for that time step
Calculating Alignment Scores — using the new decoder hidden state and the encoder hidden states, alignment scores are calculated
Softmaxing the Alignment Scores — the alignment scores for each encoder hidden state are combined and represented in a single vector and subsequently softmaxed
Calculating the Context Vector — the encoder hidden states and their respective alignment scores are multiplied to form the context vector
Producing the Final Output — the context vector is concatenated with the decoder hidden state generated in step 2 as passed through a fully connected layer to produce a new output
The process (steps 2–6) repeats itself for each time step of the decoder until an token is produced or output is past the specified maximum length

2. Local Attention(Bahdanau Attention)

The entire step-by-step process of applying Attention in Bahdanau’s paper is as follows:

Producing the Encoder Hidden States — Encoder produces hidden states of each element in the input sequence
Calculating Alignment Scores between the previous decoder hidden state and each of the encoder’s hidden states are calculated (Note: The last encoder hidden state can be used as the first hidden state in the decoder)
Softmaxing the Alignment Scores — the alignment scores for each encoder hidden state are combined and represented in a single vector and subsequently softmaxed
Calculating the Context Vector — the encoder hidden states and their respective alignment scores are multiplied to form the context vector
Decoding the Output — the context vector is concatenated with the previous decoder output and fed into the Decoder RNN for that time step along with the previous decoder hidden state to produce a new output
The process (steps 2–5) repeats itself for each time step of the decoder until an token is produced or output is past the specified maximum length

Selecting Optimiser, defining Loss Function and Setting checkpoints

Let’s also setup the Tensorboard Summary Writer

Training Step:

The ENCODER output, hidden state(initialised to 0) and the DECODER input(which is the <start> token) are passed to the DECODER.
The DECODER returns the predictions and the DECODER hidden state.
The DECODER hidden state is then passed back into the model and the predictions are used to calculate the loss. While training, we use the Teacher Forcing technique, to decide the next input of the DECODER.
Teacher Forcing is the technique where the target word is passed as the next input to the DECODER. This technique helps to learn the correct sequence or correct statistical properties fro the sequence, quickly.
Final step is to calculate the Gradient and apply it to the optimizer and backpropagate.

Testing Step

It is similar to training step, just that we do not update the gradients, and provide the predicted output as decoder input to next RNN cell at next time steps.
Test step is required to find out whether the model built is overfitting or not.

Let’s start the training now

Tensorboard Logs :

Let’s plot the train and test losses to check overfitting

The below graph shows overfitting. This might be due to less training data.Training the model with larger datasets like MS-COCO or Flickr30 can help solve this.

Evaluating the Captioning Model:

The evaluate function is similar to the training loop, except we don’t use Teacher Forcing here. The input to Decoder at each time step is its previous predictions, along with the hidden state and the ENCODER output.

Few key points to remember while making predictions.

1. Stop predicting when the model predicts the end token.

2 . Store the attention weights for every time step.

Given below are two methods to evaluate the captions

Greedy Approach
Beam Search

Greedy Approach

This is called as Maximum Likelihood Estimation (MLE) i.e. we select that word which is most likely according to the model for the given input. And sometimes this method is also called as Greedy Search, as we greedily select the word with maximum probability.

Beam Search

Here we take top k predictions, feed them again in the model and then sort them using the probabilities returned by the model. So, the list will always contain the top k predictions. In the end, we take the one with the highest probability and go through it till we encounter <end> or reach the maximum caption length.

Helper function to visualise the attention points that predicts the words.

Metric used : BLEU Score:

We use the BLEU measure to evaluate the result of the the test set generated captions. The BLEU is simply taking the fraction of n-grams in the predicted sentence that appears in the ground-truth.
BLEU is a well-acknowledged metric to measure the similarly of one hypothesis sentence to multiple reference sentences. Given a single hypothesis sentence and multiple reference sentences, it returns value between 0 and 1. The metric close to 1 means that the two are very similar.
To know more please click here.

Check the output’s to better understand the BLEU Score:

Results Obtained:

Conclusion:

So all in all, I must say that my naive first-cut model, without any rigorous hyper-parameter tuning does a decent job in generating captions for images.
We must understand that the images used for testing must be semantically related to those used for training the model. For example, if we train our model on the images of cats, dogs, etc. we must not test it on images of air planes, waterfalls, etc. This is an example where the distribution of the train and test sets will be very different and in such cases no Machine Learning model in the world will give good performance.
Beam Search generated better results than Greedy Search.

Future Work :

Of course this is just a first-cut solution and a lot of modifications can be made to improve this solution like:

Using a larger dataset.
Changing the model architecture.(Adding BatchNormalization Layer, Dropouts etc.)
Doing more hyper parameter tuning (learning rate, batch size, number of layers, number of units, dropout rate, batch normalisation etc.).
Keep researching on this topic and optimizing the solution even further.
API’fy this model using FLASK and deploy it in AWS.

Where can you find my code?

Github link : https://github.com/SubhamIO/Image-Captioning-using-Attention-Mechanism-Local-Attention-and-Global-Attention-

You can also connect me on LinkedIn: https://www.linkedin.com/in/subham-sarkar-4224aa147/

Thanks for reading !

If you find my blog useful , please do clap , share and follow me .

References:

https://www.appliedaicourse.com/lecture/11/applied-machine-learning-online-course/4150/attention-models-in-deep-learning/8/module-8-neural-networks-computer-vision-and-deep-learning
Deep Learning Lectures by Prof. Mitesh M Khapra(IIT Madras) : https://www.youtube.com/watch?v=yInilk6x-OY&list=PLyqSpQzTE6M9gCgajvQbc68Hk_JKGBAYT&index=115
Neural Machine Translation(Research Paper):https://arxiv.org/pdf/1409.0473.pdf
Local Attention : https://arxiv.org/pdf/1502.03044.pdf
Global Attention : https://arxiv.org/pdf/1508.04025.pdf
Tensorflow Blog: https://www.tensorflow.org/tutorials/text/image_captioning
https://machinelearningmastery.com/calculate-bleu-score-for-text-python/
https://towardsdatascience.com/intuitive-understanding-of-attention-mechanism-in-deep-learning-6c9482aecf4f
https://www.youtube.com/watch?v=yInilk6x-OY&list=PLyqSpQzTE6M9gCgajvQbc68Hk_JKGBAYT&index=115
CS231n: Video by Andrej Karpathy on Image Captioning : https://www.youtube.com/watch?v=NfnWJUyUJYU&list=PLkt2uSq6rBVctENoVBg1TpCC7OQi31AlC