Skip to content

Softmax is actually a softer version of argmax.

Softmax: The Smooth Alternative to Argmax

How many different ways can we think of to interpret softmax function? I bet it is more than you think! In this previous post, we showed that softmaxing can be thought of as utility maximization. In this one, we are going to focus on that the softmax function is essentially a differentiable approximation of argmax. This connection is both elegant and mathematically profound.

The argmax

In case it is the first time you are hearing, the argmax function identifies the index of the maximum value in a list or array. Mathematically, for a vector z=[z1,z2,,zn]\mathbf{z} = [z_1, z_2, \dots, z_n]:

argmaxizi=ekwhere k=argmaxizi\text{argmax}_i \, z_i = \mathbf{e}_k \quad \text{where } k = \arg \max_i z_i

Example:

Given z=[2,5,1,3]\mathbf{z} = [2, 5, 1, 3], we have:

argmaxizi=1argmaxizi=[0,1,0,0]\text{argmax}_i \, z_i = 1 \quad \text{argmax}_i \, z_i = [0, 1, 0, 0]

Introducing Softmax

The softmax function transforms a vector of real numbers into a probability distribution. Each element is squashed between 0 and 1, and the entire vector sums to 1. Softmax is smooth and differentiable

Mathematically:

σ(z)i=ezij=1nezj\sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^{n} e^{z_j}}

Softmax can be seen as a “soft” version of argmax because it highlights the largest values while still assigning some probabilities to other elements.

The Temperature Parameter

One can introduce a temperature parameter τ to control the entropy or the “sharpness” of the softmax distribution:

σ(z)i=ezi/τj=1nezj/τ\sigma(\mathbf{z})_i = \frac{e^{z_i / \tau}}{\sum_{j=1}^{n} e^{z_j / \tau}}

The key insight is that as τ approaches 0, softmax converges to argmax:

limτ0σ(z)i={1,if zi=maxjzj0,otherwise\lim_{\tau \to 0} \sigma(\mathbf{z})_i = \begin{cases} 1, & \text{if } z_i = \max_j z_j \\ 0, & \text{otherwise} \end{cases}

This behavior exactly mirrors the argmax function while retaining differentiability, which is crucial for gradient-based optimization in machine learning.

For example, with decreasing temperature:

z = [1.0, 2.0, 1.5]

τ = 1.0:   [0.21, 0.49, 0.30]
τ = 0.1:   [0.00, 0.99, 0.01]
τ = 0.01:  [0.00, 1.00, 0.00]  # Almost exactly argmax!

This is why softmax is often called the “soft” version of argmax - it gives us all the benefits of a smooth, differentiable function while approximating the hard decision-making of argmax.


Previous Post
Gumbel-Max Trick
Next Post
Softmax to Gibbs