Softmax-ing is actually (Random) Utility Maximization

This post is long overdue but it is better late than never. Softmaxing is prevalent in many fields of science but this one is going to be about utility maximization. I will be showing how softmaxing logits can be seen as maximizing utility when faced with multiple choices. ( This property is also going to be the base to show the mathematical principle that the Gumbel-Max Trick relies on, which I demonstrate in another post)

The Utility Function

Consider a utility maximizing decision-maker who has to choose from among $K$ alternatives. Suppose for each alternative $i$ , the total utility $U_i$ consists of two parts:

Deterministic component(aka logits) $V_i$ : This is the observable part of the utility. We can assume a functional form about how $V_i$ is constructed but since it is more general, I will leave it like this.
Random component $\epsilon_i$ : This captures unobservable factors influencing the utility and is modeled as a random variable.

Thus, the total utility for each alternative $i$ is given by:

U_i = V_i + \epsilon_i

The decision-maker selects the alternative with the highest utility. Therefore, the utility maximization problem, that he/she/they are facing is:

\max_{i \in \{1, \dots, K\}} \, \mathbb{E}[U_i] = \max_{i \in \{1, \dots, K\}} \, \mathbb{E}[V_i + \epsilon_i]

Now this will get a bit crowded with equations before it is not, so bare with me. We know that the optimal selection will be random as all of the choices have random parts. Also, for the $i^{th}$ option to be the optimal we need it to give more utility than the rest of the tokens, ohmm, I am sorry, options. Thus the probability of $i^{th}$ option to be the optimal one can be expressed as:

P(i) = \Pr(U_i \geq U_j, \ \ \forall \ j \neq i)

If we just plug the expression for total utility $U_i = V_i + \epsilon_i$ , we get:

P(i) = \Pr(V_i + \epsilon_i \geq V_j + \epsilon_j, \ \ \forall \ j \neq i)

Next, we are going to tie this beuty to the softmax. Writing the expression more cleanly:

P(i) = \Pr( \epsilon_j- \epsilon_i \leq + (V_i - V_j) \ \forall \ j \neq i)

From here there are two paths that can take us to the desired point. One easy and one not-so-easy way. The easy way uses the random knowledge of “The difference of Gumbel random variables follows a logistic distribution”. But lets not take that route.

Not-So-Easy Way

Let’s go from the not-so-easy way and not ask ourselves why.

Given that $\epsilon_j$ are independent and identically distributed $(i.i.d.)$ , we can integrate over the distribution of $\epsilon_i$ to account for all possible values:

P(i) = \int_{-\infty}^{\infty} \left[ \prod_{j \neq i} F(\epsilon_i + V_i - V_j) \right] f(\epsilon_i) \, d\epsilon_i

where $F(\cdot)$ is the cumulative distribution function (CDF) of $\epsilon_j$ and $f(\cdot)$ is the probability density function (PDF) of $\epsilon_i$ .

Assume that $\epsilon_i$ are i.i.d. and follow a Gumbel (Type I Extreme Value) distribution:

\text{Gumbel PDF:} \quad f(x) = e^{-(x + e^{-x})}

\text{Gumbel CDF} \quad F(x) = \exp(-e^{-x})

Substitute $F(\cdot)$ and $f(\cdot)$ into the expression:

\text{CDF:} \quad F(\epsilon_i + V_i - V_j) = \exp(-e^{-(\epsilon_i + V_i - V_j)})

\text{PDF:} \quad f(\epsilon_i) = e^{-(\epsilon_i + e^{-\epsilon_i})}

If we plug the expression for the CDF in to product inside the square bracket, the product over $j \neq i$ becomes:

\prod_{j \neq i} \exp\left( -e^{-(\epsilon_i + V_i - V_j)} \right) = \exp\left( -\sum_{j \neq i} e^{-(\epsilon_i + V_i - V_j)} \right) = \exp\left( -e^{-\epsilon_i} \sum_{j \neq i} e^{V_j - V_i} \right)

Now, plug the $[$ product $]$ and the PDF to the main equation becomes this messy looking expression but things will simplify:

P(i) = \int_{-\infty}^{\infty} \exp\left( -e^{-\epsilon_i} \sum_{j \neq i} e^{V_j - V_i} \right) \cdot e^{-(\epsilon_i + e^{-\epsilon_i})} d\epsilon_i

Just seperating the terms in the PDF,

P(i) = \int_{-\infty}^{\infty} \exp\left( -e^{-\epsilon_i} \sum_{j \neq i} e^{V_j - V_i} \right) \cdot e^{-e^{-\epsilon_i}} \cdot e^{-\epsilon_i} d\epsilon_i

In terms of being super clear write the double exponential slightly clearer way,

P(i) = \int_{-\infty}^{\infty} \exp\left( -e^{-\epsilon_i} \sum_{j \neq i} e^{V_j - V_i} \right) \cdot \exp\left(-e^{-\epsilon_i}\right) \cdot e^{-\epsilon_i} d\epsilon_i

Put it inside the same exponent,

P(i) = \int_{-\infty}^{\infty} \exp\left( -e^{-\epsilon_i} \left( 1 + \sum_{j \neq i} e^{V_j - V_i} \right) \right) \cdot e^{-\epsilon_i} d\epsilon_i

and to make a favor to our eyes, let:

S = 1 + \sum_{j \neq i} e^{V_j - V_i}

Now, the choice probability is:

P(i) = \int_{-\infty}^{\infty}e^{-\epsilon_i} \cdot \exp\left( -e^{-\epsilon_i} S \right)d\epsilon_i

We are almost there but we first need to make a substitution:

t = e^{-\epsilon_i} \quad \Rightarrow \quad \epsilon_i = -\ln t, \quad d\epsilon_i = -\frac{dt}{t}

When $\epsilon_i$ goes from $-\infty$ to $\infty$ , $t$ goes from $\infty$ to $0$ .

Substitute $t$ and $d\epsilon_i$ into the integral:

P(i) = \int_{t=\infty}^{t=0} e^{-\left( -\ln t \right)} \cdot \exp(-tS) \cdot \left( -\frac{dt}{t} \right)

Now this is good news because $e^{-\left( -\ln t \right)} = t$ and $t \left( -\frac{dt}{t} \right) = -dt$ and we can also get rid of the negative by switching the boundaries of the integral,

P(i) = \int_{t=\infty}^{t=0} \exp(-tS) (-dt) = \int_{t=0}^{t=\infty} \exp(-tS) \, dt

This is not a hard integral to solve,

P(i) = \int_0^{\infty} \exp(-tS) \, dt = \frac{1}{S}

Plugging the $S$ back,

P(i) = \frac{1}{S} = \frac{1}{1 + \sum_{j \neq i} e^{V_j - V_i}}

Finally, let’s write $e^{V_j - V_i}$ as $\frac{e^{V_j}}{e^{V_i}}$ :

S = 1 + \sum_{j \neq i} \frac{e^{V_j}}{e^{V_i}} = \frac{1}{e^{V_i}} \left( e^{V_i} + \sum_{j \neq i} e^{V_j} \right) = \frac{1}{e^{V_i}} \left( \sum_{j=1}^{K} e^{V_j} \right)

Therefore:

P(i) = \frac{1}{S} = \frac{e^{V_i}}{\sum_{j=1}^{K} e^{V_j}}

Thus, the choice probability for alternative ( i ) is:

P(i) = \frac{e^{V_i}}{\sum_{k=1}^{K} e^{V_k}}

We will continue messing with the softmax in other posts.

References

McFadden, D. (1974). “Conditional logit analysis of qualitative choice behavior.”
Train, K. (2009). Discrete Choice Methods with Simulation.

Softmaxing is actually Utility Maximization

Softmax-ing is actually (Random) Utility Maximization

The Utility Function

Not-So-Easy Way

References