Skip to content

Softmaxing is actually Utility Maximization

Softmax-ing is actually (Random) Utility Maximization

This post is long overdue but it is better late than never. Softmaxing is prevalent in many fields of science but this one is going to be about utility maximization. I will be showing how softmaxing logits can be seen as maximizing utility when faced with multiple choices. ( This property is also going to be the base to show the mathematical principle that the Gumbel-Max Trick relies on, which I demonstrate in another post)

The Utility Function

Consider a utility maximizing decision-maker who has to choose from among KK alternatives. Suppose for each alternative ii, the total utility UiU_i consists of two parts:

Thus, the total utility for each alternative ii is given by:

Ui=Vi+ϵiU_i = V_i + \epsilon_i

The decision-maker selects the alternative with the highest utility. Therefore, the utility maximization problem, that he/she/they are facing is:

maxi{1,,K}E[Ui]=maxi{1,,K}E[Vi+ϵi]\max_{i \in \{1, \dots, K\}} \, \mathbb{E}[U_i] = \max_{i \in \{1, \dots, K\}} \, \mathbb{E}[V_i + \epsilon_i]

Now this will get a bit crowded with equations before it is not, so bare with me. We know that the optimal selection will be random as all of the choices have random parts. Also, for the ithi^{th} option to be the optimal we need it to give more utility than the rest of the tokens, ohmm, I am sorry, options. Thus the probability of ithi^{th} option to be the optimal one can be expressed as:

P(i)=Pr(UiUj,   ji)P(i) = \Pr(U_i \geq U_j, \ \ \forall \ j \neq i)

If we just plug the expression for total utility Ui=Vi+ϵiU_i = V_i + \epsilon_i, we get:

P(i)=Pr(Vi+ϵiVj+ϵj,   ji)P(i) = \Pr(V_i + \epsilon_i \geq V_j + \epsilon_j, \ \ \forall \ j \neq i)

Next, we are going to tie this beuty to the softmax. Writing the expression more cleanly:

P(i)=Pr(ϵjϵi+(ViVj)  ji)P(i) = \Pr( \epsilon_j- \epsilon_i \leq + (V_i - V_j) \ \forall \ j \neq i)

From here there are two paths that can take us to the desired point. One easy and one not-so-easy way. The easy way uses the random knowledge of “The difference of Gumbel random variables follows a logistic distribution”. But lets not take that route.

Not-So-Easy Way

Let’s go from the not-so-easy way and not ask ourselves why.

Given that ϵj\epsilon_j are independent and identically distributed (i.i.d.)(i.i.d.), we can integrate over the distribution of ϵi\epsilon_i to account for all possible values:

P(i)=[jiF(ϵi+ViVj)]f(ϵi)dϵiP(i) = \int_{-\infty}^{\infty} \left[ \prod_{j \neq i} F(\epsilon_i + V_i - V_j) \right] f(\epsilon_i) \, d\epsilon_i

where F()F(\cdot) is the cumulative distribution function (CDF) of ϵj\epsilon_j and f()f(\cdot) is the probability density function (PDF) of ϵi\epsilon_i.

Assume that ϵi\epsilon_i are i.i.d. and follow a Gumbel (Type I Extreme Value) distribution:

Gumbel PDF:f(x)=e(x+ex)\text{Gumbel PDF:} \quad f(x) = e^{-(x + e^{-x})} Gumbel CDFF(x)=exp(ex)\text{Gumbel CDF} \quad F(x) = \exp(-e^{-x})

Substitute F()F(\cdot) and f()f(\cdot) into the expression:

CDF:F(ϵi+ViVj)=exp(e(ϵi+ViVj))\text{CDF:} \quad F(\epsilon_i + V_i - V_j) = \exp(-e^{-(\epsilon_i + V_i - V_j)}) PDF:f(ϵi)=e(ϵi+eϵi)\text{PDF:} \quad f(\epsilon_i) = e^{-(\epsilon_i + e^{-\epsilon_i})}

If we plug the expression for the CDF in to product inside the square bracket, the product over jij \neq i becomes:

jiexp(e(ϵi+ViVj))=exp(jie(ϵi+ViVj))=exp(eϵijieVjVi)\prod_{j \neq i} \exp\left( -e^{-(\epsilon_i + V_i - V_j)} \right) = \exp\left( -\sum_{j \neq i} e^{-(\epsilon_i + V_i - V_j)} \right) = \exp\left( -e^{-\epsilon_i} \sum_{j \neq i} e^{V_j - V_i} \right)

Now, plug the [[product]] and the PDF to the main equation becomes this messy looking expression but things will simplify:

P(i)=exp(eϵijieVjVi)e(ϵi+eϵi)dϵiP(i) = \int_{-\infty}^{\infty} \exp\left( -e^{-\epsilon_i} \sum_{j \neq i} e^{V_j - V_i} \right) \cdot e^{-(\epsilon_i + e^{-\epsilon_i})} d\epsilon_i

Just seperating the terms in the PDF,

P(i)=exp(eϵijieVjVi)eeϵieϵidϵiP(i) = \int_{-\infty}^{\infty} \exp\left( -e^{-\epsilon_i} \sum_{j \neq i} e^{V_j - V_i} \right) \cdot e^{-e^{-\epsilon_i}} \cdot e^{-\epsilon_i} d\epsilon_i

In terms of being super clear write the double exponential slightly clearer way,

P(i)=exp(eϵijieVjVi)exp(eϵi)eϵidϵiP(i) = \int_{-\infty}^{\infty} \exp\left( -e^{-\epsilon_i} \sum_{j \neq i} e^{V_j - V_i} \right) \cdot \exp\left(-e^{-\epsilon_i}\right) \cdot e^{-\epsilon_i} d\epsilon_i

Put it inside the same exponent,

P(i)=exp(eϵi(1+jieVjVi))eϵidϵiP(i) = \int_{-\infty}^{\infty} \exp\left( -e^{-\epsilon_i} \left( 1 + \sum_{j \neq i} e^{V_j - V_i} \right) \right) \cdot e^{-\epsilon_i} d\epsilon_i

and to make a favor to our eyes, let:

S=1+jieVjViS = 1 + \sum_{j \neq i} e^{V_j - V_i}

Now, the choice probability is:

P(i)=eϵiexp(eϵiS)dϵiP(i) = \int_{-\infty}^{\infty}e^{-\epsilon_i} \cdot \exp\left( -e^{-\epsilon_i} S \right)d\epsilon_i

We are almost there but we first need to make a substitution:

t=eϵiϵi=lnt,dϵi=dttt = e^{-\epsilon_i} \quad \Rightarrow \quad \epsilon_i = -\ln t, \quad d\epsilon_i = -\frac{dt}{t}

When ϵi\epsilon_i goes from -\infty to \infty, tt goes from \infty to 00.

Substitute tt and dϵid\epsilon_i into the integral:

P(i)=t=t=0e(lnt)exp(tS)(dtt)P(i) = \int_{t=\infty}^{t=0} e^{-\left( -\ln t \right)} \cdot \exp(-tS) \cdot \left( -\frac{dt}{t} \right)

Now this is good news because e(lnt)=te^{-\left( -\ln t \right)} = t and t(dtt)=dtt \left( -\frac{dt}{t} \right) = -dt and we can also get rid of the negative by switching the boundaries of the integral,

P(i)=t=t=0exp(tS)(dt)=t=0t=exp(tS)dtP(i) = \int_{t=\infty}^{t=0} \exp(-tS) (-dt) = \int_{t=0}^{t=\infty} \exp(-tS) \, dt

This is not a hard integral to solve,

P(i)=0exp(tS)dt=1SP(i) = \int_0^{\infty} \exp(-tS) \, dt = \frac{1}{S}

Plugging the SS back,

P(i)=1S=11+jieVjViP(i) = \frac{1}{S} = \frac{1}{1 + \sum_{j \neq i} e^{V_j - V_i}}

Finally, let’s write eVjVie^{V_j - V_i} as eVjeVi\frac{e^{V_j}}{e^{V_i}}:

S=1+jieVjeVi=1eVi(eVi+jieVj)=1eVi(j=1KeVj)S = 1 + \sum_{j \neq i} \frac{e^{V_j}}{e^{V_i}} = \frac{1}{e^{V_i}} \left( e^{V_i} + \sum_{j \neq i} e^{V_j} \right) = \frac{1}{e^{V_i}} \left( \sum_{j=1}^{K} e^{V_j} \right)

Therefore:

P(i)=1S=eVij=1KeVjP(i) = \frac{1}{S} = \frac{e^{V_i}}{\sum_{j=1}^{K} e^{V_j}}

Thus, the choice probability for alternative ( i ) is:

P(i)=eVik=1KeVkP(i) = \frac{e^{V_i}}{\sum_{k=1}^{K} e^{V_k}}

We will continue messing with the softmax in other posts.

References


Previous Post
Likelihood Ratio Trick a.k.a REINFORCE
Next Post
Log-Derivative Trick