Softmax-ing is actually (Random) Utility Maximization
This post is long overdue but it is better late than never. Softmaxing is prevalent in many fields of science but this one is going to be about utility maximization. I will be showing how softmaxing logits can be seen as maximizing utility when faced with multiple choices. ( This property is also going to be the base to show the mathematical principle that the Gumbel-Max Trick relies on, which I demonstrate in another post)
The Utility Function
Consider a utility maximizing decision-maker who has to choose from among K alternatives. Suppose for each alternative i, the total utility Ui consists of two parts:
Deterministic component(aka logits)Vi: This is the observable part of the utility. We can assume a functional form about how Vi is constructed but since it is more general, I will leave it like this.
Random componentϵi: This captures unobservable factors influencing the utility and is modeled as a random variable.
Thus, the total utility for each alternative i is given by:
Ui=Vi+ϵi
The decision-maker selects the alternative with the highest utility. Therefore, the utility maximization problem, that he/she/they are facing is:
i∈{1,…,K}maxE[Ui]=i∈{1,…,K}maxE[Vi+ϵi]
Now this will get a bit crowded with equations before it is not, so bare with me. We know that the optimal selection will be random as all of the choices have random parts. Also, for the ith option to be the optimal we need it to give more utility than the rest of the tokens, ohmm, I am sorry, options. Thus the probability of ith option to be the optimal one can be expressed as:
P(i)=Pr(Ui≥Uj,∀j=i)
If we just plug the expression for total utility Ui=Vi+ϵi, we get:
P(i)=Pr(Vi+ϵi≥Vj+ϵj,∀j=i)
Next, we are going to tie this beuty to the softmax. Writing the expression more cleanly:
P(i)=Pr(ϵj−ϵi≤+(Vi−Vj)∀j=i)
From here there are two paths that can take us to the desired point. One easy and one not-so-easy way. The easy way uses the random knowledge of “The difference of Gumbel random variables follows a logistic distribution”. But lets not take that route.
Not-So-Easy Way
Let’s go from the not-so-easy way and not ask ourselves why.
Given that ϵj are independent and identically distributed (i.i.d.), we can integrate over the distribution of ϵi to account for all possible values:
P(i)=∫−∞∞j=i∏F(ϵi+Vi−Vj)f(ϵi)dϵi
where F(⋅) is the cumulative distribution function (CDF) of ϵj and f(⋅) is the probability density function (PDF) of ϵi.
Assume that ϵi are i.i.d. and follow a Gumbel (Type I Extreme Value) distribution: