Recently, there is a discussion on wheather the attention can be used to intepret deep models, see “Attention is not not Explanation” and “Attention is not not Explanation“. However, I want to introduce a simple but powerful attribution method to interpret deep models. This method is Integrated Gradients, proposed by a Google’s paper “Axiomatic Attribution for Deep Networks“.

In short, the attribution for the ith feature of the input is defined as the path intergral of the gradients along the straightline path from the baseline x^{\prime}_i to the input x_i:

\text { IntegratedGrads }_{i}(x)::=\left(x_{i}-x_{i}^{\prime}\right) \times \int_{\alpha=0}^{1} \frac{\partial F\left(x^{\prime}+\alpha \times\left(x-x^{\prime}\right)\right)}{\partial x_{i}} d \alpha

F: \mathrm{R}^{n} \rightarrow[0,1] represents a deep network. \frac{\partial F(x)}{\partial x_{i}} is the gradient of F(x) along the ith dimension.

So we can see that we need two steps to use this method:

- Selecting a Benchmark. The paper recommend to use the baseline which has a near-zero score. For CV, it can be a black image or an adversarial example that has a zero score for a given input label (say elephant), by applying a tiny, carefully-designed perturbation to an image with a very different label (say microscope). For text based networks, we have found that the all zero input embedding vector is a good baseline.
- Computing Integrated Gradients. It can be calculated by simply summing the gradients at points occurring at sufficiently small intervals along the straightline path from the baseline x^{\prime} to the input x.

\begin{array}{r}

\text { IntegratedGrads }_{i}^{a p p r o x}(x)::= \left(x_{i}-x_{i}^{\prime}\right) \times \Sigma_{k=1}^{m} \frac{\left.\partial F\left(x^{\prime}+\frac{k}{m} \times\left(x-x^{\prime}\right)\right)\right)}{\partial x_{i}} \times \frac{1}{m}

\end{array}

This method is implemented in the captum, which is a model interpretability and understanding library for PyTorch.

What’s more, the paper uses a lot of space to prove some good properties of this method:

- Sensitivity: for every input and baseline that differ in one feature but have different predictions, the differing feature should be given a non-zero attribution.
- Implementation Invariance: the attributions are always identical for two functionally equivalent networks.
- Completeness: the attributions add up to the difference between the output of F at the input x and the baseline x^{\prime}.
- Linearity: the attributions for a \cdot f_1 + b \cdot f_2 to be the weighted sum of the attributions for f_1 and f_2 with weights a and b respectively.
- Symmetry-Preserving: for all inputs that have identical values for symmetric variables and baselines that have identical values for symmetric variables, the symmetric variables receive identical attributions.