Softmax: The Smooth Alternative to Argmax
How many different ways can we think of to interpret softmax function? I bet it is more than you think! In this previous post, we showed that softmaxing can be thought of as utility maximization. In this one, we are going to focus on that the softmax function is essentially a differentiable approximation of argmax. This connection is both elegant and mathematically profound.
The argmax
In case it is the first time you are hearing, the argmax
function identifies the index of the maximum value in a list or array. Mathematically, for a vector :
Example:
Given , we have:
Introducing Softmax
The softmax function transforms a vector of real numbers into a probability distribution. Each element is squashed between 0 and 1, and the entire vector sums to 1. Softmax is smooth and differentiable
Mathematically:
Softmax can be seen as a “soft” version of argmax because it highlights the largest values while still assigning some probabilities to other elements.
The Temperature Parameter
One can introduce a temperature parameter τ to control the entropy or the “sharpness” of the softmax distribution:
The key insight is that as τ approaches 0, softmax converges to argmax:
- Note that as , softmax outputs a uniform distribution(maximum entropy).
This behavior exactly mirrors the argmax function while retaining differentiability, which is crucial for gradient-based optimization in machine learning.
For example, with decreasing temperature:
z = [1.0, 2.0, 1.5]
τ = 1.0: [0.21, 0.49, 0.30]
τ = 0.1: [0.00, 0.99, 0.01]
τ = 0.01: [0.00, 1.00, 0.00] # Almost exactly argmax!
This is why softmax is often called the “soft” version of argmax - it gives us all the benefits of a smooth, differentiable function while approximating the hard decision-making of argmax.