In our implementation we follow the original MD-SEM paper.
As in the original paper, the model consists of three main parts: an encoder, a Temporal Excitation Module (TEM) and a decoder. As an encoder, we use the Xception network pre-trained on ImageNet. TEM module uses a Long Short Term Memory (LSTM) network to generate scaling vectors that re-weight the feature maps differently for each of T timesteps (where T = 3 in our implementation). For the decoder, we use a simple module consisting of 3 sets of convolution, upsampling layers.
The model is trained on the SALICON-MD generated from the SALICON dataset. 10000 training and 5000 validation images are used. In our experiments with this model we trained it using the Kullback-Leibler divergence loss. As optimizer, we use Adam with inital learning rate of 0.0001, which is reduced by a factor of ten every three epochs. Our implementation achieves reasonable results after 5 epochs.
| Dataset | AUC | CC | KLDiv | NSS | SIM |
|---|---|---|---|---|---|
| SALICON-MD | 0.90 | 0.75 | 0.38 | 1.45 | 0.68 |
| CodeCharts1k | 0.84 | 0.56 | 0.97 | 1.84 | 0.48 |
Pre-trained versions of the model are available here. Use the created notebooks to try out the model, making sure to change the model location path. If you want to train it yourself, download the SALICON dataset and generate SALICON-MD based on the Tempsal code.