Gradient of the Softmax Function

March 27, 2017 • Busa Victor

I have already explained how one can compute the gradient of the svm hinge loss in the previous post. In this article I will detail how one can compute the gradient of the softmax function. This isn’t difficult yet it will help us to understand how to use the chain rule.

Loss Function

Before differentiating the softmax function, we will define the problem. In our case we want to calculate:

\[\frac{\partial L_i(f(w_k))}{\partial w_k}\]

where:

\[L_i = -log\left(\frac{e^{f_{y_i}}}{\sum\limits_{j}{e^{f_j}}}\right)\]

and:

\[f_j = w_jx_i\]

Chain Rule

Let’s recall the chain rule for 2 functions f and g:

\[\frac{\partial(g(f(x)))}{dx} = \frac{\partial g(u)}{\partial u}\frac{\partial u}{\partial x} \\ \text where \ u=f(x)\]

Using the chain rule and our notations we have:

\begin{equation} \begin{gathered} \frac{\partial L_i(f(w_k))}{\partial w_k} = \frac{\partial L_i(f_k)}{\partial f_k}\frac{\partial f_k}{\partial w_k} \newline where \ f_k=f(w_k)=w_kx_i \end{gathered} \end{equation}

Analytic Gradient

We will firstly calculate $\frac{\partial L_i(f_k)}{\partial f_k}$:

\begin{equation} \begin{gathered} \frac{\partial L_i(f_k)}{\partial f_k} = \frac{\partial}{\partial f_k}\left(-log\left(\frac{e^{f_{y_i}}}{\sum\limits_{j}{e^{f_j}}}\right)\right)
= - \left[\frac{\frac{\partial}{\partial f_k}\left(\frac{e^{f_{y_i}}}{\sum\limits_{j}{e^{f_j}}}\right)}{\frac{e^{f_{y_i}}}{\sum\limits_{j}{e^{f_j}}}}\right]
= - \left[\frac{e^{f_{y_i}}\frac{\partial}{\partial f_k}\left(\frac{1}{\sum\limits_{j}{e^{f_j}}}\right) \ + \ \frac{\partial}{\partial f_k}\left(e^{f_{y_i}}\right)\frac{1}{\sum\limits_{j}{e^{f_j}}}}{\left(\frac{e^{f_{y_i}}}{\sum\limits_{j}{e^{f_j}}}\right)}\right]
= - \left[\frac{-e^{f_{y_i}}\frac{\sum\limits_{j}\frac{\partial}{\partial f_k}{e^{f_j}}}{(\sum\limits_{j}{e^{f_j}})^{2}} + 1(k \ = \ y_i)\frac{e^{f_{y_i}}}{\sum\limits_{j}{e^{f_j}}}}{\left(\frac{e^{f_{y_i}}}{\sum\limits_{j}{e^{f_j}}}\right)}\right]
= \frac{\frac{e^{f_{y_i}}e^{f_k}}{\sum\limits_{j}{e^{f_j}}} \ - \ 1(k \ = \ y_i)e^{f_{y_i}}}{e^{f_{y_i}}} = (p_k \ - \ 1(k \ = \ y_i)) \end{gathered} \end{equation}

where we used the fact that $p_k = \frac{e^{f_{k}}}{\sum\limits_{j}{e^{f_j}}}$

The other term of the relationship can be calculated straight away:

\begin{equation} \begin{aligned} \frac{\partial f_k}{\partial w_k} = \frac{\partial(w_kx_i)}{\partial dw_k} = x_i \end{aligned} \end{equation}

Finally, using equations (1), (2), (3) we have:

\[\frac{\partial L_i(f(w_k))}{\partial w_k} = \frac{\partial L_i(f_k)}{\partial f_k}\frac{\partial f_k}{\partial w_k}=(p_k - 1(k \ = \ y_i))x_i\]

In python we can write he following vectorized implementation:

  f = X.dot(W) #(N, C)
  f -= np.max(f, axis=1, keepdims=True) #for stability

  # Loss: L_i = - f(x_i)_{y_i} + log \sum_j e^{f(x_i)_j}
  # Compute vector of stacked correct f-scores: [f(x_1)_{y_1}, ..., f(x_N)_{y_N}]
  # (where N = num_train)
  f_y = f[range(num_train), y]

  exp_f_y = np.exp(f_y)
  exp_f = np.exp(f)

  loss = -np.sum(np.log(exp_f_y/np.sum(exp_f, axis=1)))

  #compute gradient (backward pass)
  p = exp_f/np.sum(exp_f, axis=1, keepdims=True)
  ind = np.zeros_like(f)
  ind[range(num_train),  y] = 1
  dW = X.T.dot(p - ind)

Conclusion

In this article We have seen how to differentiate the softmax function. it wasn’t difficult. We just had to use derivative relations like: $\frac{d}{dx}\left(log(u)\right)=\frac{\frac{du}{dx}}{u}$, or $\frac{d}{dx}\left(u.v\right)=\frac{du}{dx}v + u\frac{dv}{dx}$. Finally we applied the chain rule to obtain the gradient w.r.t the variables we were interested in.