All Articles

XAI Methods - Saliency

Published 21 Feb 2022 · 10 min read

What is Saliency and why the same is confusing?

Saliency [1] is one of the first attribution methods designed to visualize the input attribution of the Convolutional Network. Because the word saliency is often related to the whole approach to display input attribution called Saliency Map, this method is also known as Vanilla Gradient.

The idea of the Saliency method starts from the class visualization by finding L2-regularizedL_2\text{-regularized} image II that maximizes score ScS_c for a given class cc. It can be written formally as:

argmaxISc(I)λI22\arg \max _{I} S_{c}(I)-\lambda\|I\|_{2}^{2}

Where λ\lambda is a regularisation parameter. To find the value of II, we can use the back-propagation method. Unlike in the standard learning process, we are going to back-propagate with respect to the input image, not the first convolution layer. This optimization allows us to produce images that visualize a particular class in our model (see Fig. 1).

Class visualization
Figure 1: The class model visualizations for several classes, source [1].

From class visualization to Saliency

This idea can be extrapolated, and with minor modifications, we should be able to query for spatial support of class cc in a given image I0I_0. To do this, we have to rank pixels of I0I_0 in relation to their importance in predicting score Sc(I0)S_c(I_0). Authors assume that we can approximate Sc(I)S_c(I) with a linear function in the neighborhood of I0I_0 with:

Sc(I)wI+bS_{c}(I) \approx w^\intercal I + b

For a pair of input image I0Rm×nI_0 \in \mathbb{R}^{m \times n} and the class cc, we are able to compute saliency map ARm×nA \in \mathbb{R}^{m \times n} (where mm and nn are the height and width of the input in pixels). All we have to do is to compute derivative ww and rearrange elements in the returned vector.

This method uses different approaches base on the number of channels in the input image I0I_0. For grey-scale pixels (one color channel), we can rearrange the pixels to match the shape of the image. If the number of channels is greater than one, we are going to use the maximum value from each set of values related to the specified pixel.

Ai,j=maxchwh(i,j,chA_{i,j} = \max _{ch}|w_{h(i,j,ch}|

where chch is a color channel of the pixel (i,j)(i,j) and h(i,j,ch)h(i,j,ch) is an index of the ww corresponding to the same pixel (i,j)(i,j). With the obtained map, we can visualize pixel importance for the input image I0I_0 as shown in the Figure 2.

Saliency result
Figure 2: Visualization of the saliency map by the Saliency generated for the class "pug". Image source: Stanford Dogs

The original Saliency method produces a lot of additional noise but still gives us an idea of which part of the input image is relevant when predicting a specific class. This often causes a problem when the object on the image has a lot of details and the model is using most of them to make a prediction.

Further reading

I’ve decided to create a series of articles explaining the most important XAI methods currently used in practice. Here is the main article: XAI Methods - The Introduction


  1. K. Simonyan, A. Vedaldi, A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps, 2014.
  2. A. Khosla, N. Jayadevaprakash, B. Yao, L. Fei-Fei. Stanford dogs dataset., 2019. Accessed: 2021-10-01.