Softmax

Softmax is one of the most important layers in deep learning. It is usually the last layer of the network and it is responsible to compute the discrete probability distribution over network scores. For the classification problem the softmax layer outputs confidence levels for each output dimension (for each class).

For example, imagine a classification problem where we need to classify images in 5 classes. The neural network must have 5 independent outputs each one representing a specific class. The softmax function outputs 5 confidence levels and the sum of these confidence is equal to 1. Since the output is a probability distribution the output with highest confidence is the most likely class of the image.

The softmax function $s_{i}$ is defined as:

$$ s_{i} = \frac{ e^{x_{i}} } { \sum_{j=1}^{N} { e^{x_{j}} } }$$

The index $i$ represents a class, $x_{i}$ is the network output for the class $i$ and $N$ is the total number of classes. It is important to note that softmax output of each class considers the output of the other classes. In other words, the confidence levels consider the score from all classes.

By the function definition we have:

$$ \sum_{i=1}^{N} s_{i} = 1 $$

Since the softmax function provide values between 0 and 1, and the sum of the confidences is equal to 1 we consider the softmax a true discrete probability distribution. This property is not guaranteed with euclidean layers that use the mean squared error (MSE).

For a given image, if a network outputs the following scores $ [0, 10, 15, 18, 20] $, the softmax output will be:

Class	Score	Confidence
0	0	0.0000
1	10	0.0000
2	15	0.0059
3	18	0.1185
4	20	0.8756

With score of 20 and confidence level of 87.56% the most likely class is class number 4. Note the softmax output is non-linear and classes with high scores tend to reduce the confidence of low scores classes. Therefore the softmax function enforces the class separation in the network.

Since the softmax function considers the output of each class, even if the correct class has a high score other classes with close score will decrease the confidence level from the correct class. For instance, if we increase the fourth class score from 18 to 19, the confidence level from the last class goes from 87.56% to 72.75%. This minor modification could decrease the correct class confidence in more than 10%.

Also, keep in mind that the usual output from a classification network is not a probability distribution but the class of a given image. Therefore is very common to use an argmax function after the sofmax. The argmax function returns the argument(index) of the element with the highest confidence level.

Softmax loss

Since the softmax is usually the last layer from neural networks, it is also responsible to compute the error signal that will be used to optimize the network weights. The softmax layer frequently uses the cross-entropy function (from information theory) to compute the error signal.

Given a "correct" probability distribution $p$ and a "wrong" probability distribution $q$, the cross-entropy function $H$ is defined according to:

$$ H(p,q) = - \sum_{i=1}^{N}(p_{i} . log(q_{i})) $$

Where $log$ is the logarithm to the base 10.

The cross-entropy function provides the error signal using the sum of the multiplication of two probabilities distributions. The "correct" distribution is a vector that uses the correct labels while the "wrong" distribution is the output of the softmax function.

The "correct" distribution is build as an one hot encoding vector. In this technique, the element that represents the correct class is equal to 1 and the other elements are 0. One hot encoding has the great advantage to prevent the direct use of numerical values to encode classes. From the last example, a valid label vector that considers the last index the correct class of one specific sample would be equal to:

$$ labels = p_{i} = [0, 0, 0, 0, 1] $$

Also from the last example we have the following softmax output:

$$ output = q_{i} = [ 0, 0, 0.0059, 0.1185, 0.8756] $$

From the cross-entropy function:

$$ H(p,q) = - \sum_{i=1}^{N}(p_{i} . log(q_{i}))\\ = - ( 0 . log(0) + 0 . log(0) + 0 . log(0.0059) + 0 . log(0.1185) + 1 . log(0.8756) )\\ = - ( 0 . (-20.1329) + 0 . (-10.1329) + 0 . (-5.1329) + 0 . (-2.1329) + 1 . (-0.1329) )\\ = - ( - 0.13288) = 0.13288 $$

Note the $0$ argument in the $log$ function is a very small value close to $0$.

In practice, the cross-entropy function could be rewritten as:

$$ H(p,q) = - log(q_{c}) $$

Where $ c $ is the index that represents the correct class of the sample.

Due to the one hot encoding only the confidence value from the correct class is computed. However, as previously explained, the confidence levels consider the scores from all classes.

In this last example we had a label that was matching the highest confidence. Lets make a simple experiment considering the third class as the correct class. The label vector becomes:

$$ labels = p_{i} = [0, 0, 1, 0, 0] $$

And the cross-entropy:

$$ H(p,q) = -( 1 * (-5.1329)) = 5.13288 $$

Basically, the error signal increases if the confidence level from the correct class is close to 0 and decreases if the confidence approaches 1.

These previous equations consider only a single input sample. However the final loss is the average over the batch samples, given by:

$$ L = \sum_{i=1}^{N} \frac{H_{i}}{N} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{M} p_{i,j}.log(q_{i,j}) $$

Where N is the number of samples and M the number of classes.

Backpropagation gradients

The softmax gradients used to train a network are computed with the previous equation. With the chain rule we can decompose the derivative in 2 parts: the cross-entropy derivative and the softmax derivative.

$$ \frac{∂L}{∂x} = - \frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{M} \frac{∂}{∂q}( p_{i,j}.log(q_{i,j}) ) \frac{∂q_{i,j}}{∂x} $$

The labels $p_{i,j}$ are constants to the input $x$ but the confidence values $q_{i,j}$ depend on $x$. Therefore, for the cross-entropy derivative we have:

$$ \frac{∂}{∂q}( p_{i,j}.log(q_{i,j}) ) = p_{i,j} . \frac{1}{ln(10).q_{i,j}}\\ $$

To calculate the softmax derivative we use the quotient rule and we get:

$$ \frac{∂q_{i,j}}{∂x_{k}} = q_{i,j}(\delta_{i,j,k} - q_{i,j}) \\ $$ \begin{cases} j \equiv k &\rightarrow \delta=1 \\ j \neq k &\rightarrow \delta=0 \end{cases}

The softmax derivative depends on the input scores $x_{k}$. When j and k are equal they represent the same class and the derivative becomes $ q_{i,j}(1 - q_{i,j}) $. When they are different, the derivative becomes $ - q_{i,j} . q_{i,j} $

Replacing the partial derivatives in the gradient equation we get:

\begin{align} \frac{∂L}{∂x_{k}} &= - \frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{M} \frac{∂}{∂q}( p_{i,j}.log(q_{i,j}) ) \frac{∂q_{i,j}}{∂x} \\ &= - \frac{1}{N} \sum_{i=1}^{N} \sum_{j=1}^{M} p_{i,j} . \frac{1}{ln(10) . q_{i,j}} . q_{i,j} . (\delta_{i,j,k} - q_{i,j}) \\ &= - \frac{1}{N . ln(10)} \sum_{i=1}^{N} \sum_{j=1}^{M} p_{i,j} . (\delta_{i,j,k} - q_{i,j}) \end{align} \begin{cases} k \equiv j &\rightarrow \delta=1 \\ k \neq j &\rightarrow \delta=0 \end{cases}

We can simplify this equation a little bit more.
When using one hot encoding, only the correct class from the labels vector $p_{i,j}$ is equal to 1, all the other values are 0. Therefore, the summation over the M confidences have only one non-zero value and the summation can be removed from the equation.

\begin{align} \frac{∂L}{∂x_{k}} &= - \frac{1}{N . ln(10)} \sum_{i=1}^{N} (\delta_{i,c,k} - q_{i,c}) \end{align} \begin{cases} k \equiv c &\rightarrow \delta=1 \\ k \neq c &\rightarrow \delta=0 \end{cases}

In this equation, c represents the correct class (or index) from a sample i.

A python script with examples can be found at: repository.