Skip to content

Likelihood Ratio Trick a.k.a REINFORCE

The cruel beauty of life is that reality often operates in discrete jumps rather than smooth transitions. Success and failure in some random context is like a light switch - it’s either on or off, with no dimmer setting. This can feel particularly harsh because our human intuition often wants to recognize “degrees of success” or give credit for effort and near-misses and more importantly learn from them, but many real-world systems don’t work that way. Well in this post, we will see how we can deal with this problem in the context of reinforcement learning.

Familiar Scenario

If you are an econometrician or a machine learning engineer in many optimization problems, especially in supervised learning, we often estimate expectations of loss functions using sample averages and compute gradients directly:

J(θ)=Expdata[fθ(x)]1Ni=1Nfθ(x(i))J(\theta) = \mathbb{E}_{x \sim p_{\text{data}}} \left[ f_\theta(x) \right] \approx \frac{1}{N} \sum_{i=1}^N f_\theta \left( x^{(i)} \right)

We then can easily compute gradients with respect to θ\theta:

θJ(θ)1Ni=1Nθfθ(x(i))\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \nabla_\theta f_\theta \left( x^{(i)} \right)

The function fθ(x)f_\theta(x) is directly parameterized by θ\theta, and x(i)x^{(i)} are samples from a fixed data distribution independent of θ\theta. We can compute θfθ(x(i))\nabla_\theta f_\theta \left( x^{(i)} \right) because fθf_\theta is differentiable with respect to θ\theta.

The Problem

Now what happens when the parameter θ\theta is the paramater of the distribution of xx? In math the objective function is now:

J(θ)=Expθ[f(x)]J(\theta) = \mathbb{E}_{x \sim p_\theta} [f(x)]

And say, as every normal person would, you want to compute the gradient θJ(θ)\nabla_\theta J(\theta) to use in optimization algorithms like gradient ascent or descent. But you have one problem. Sampling is non-differentiable because it introduces a discontinuity as small changes in the input probabilities pp can result in discrete, non-continuous jumps in the output y. y.

For example: Consider sampling from p=[0.4,0.6]p = [0.4, 0.6]. A small change to p=[0.4+ϵ,0.6ϵ]p = [0.4 + \epsilon, 0.6 - \epsilon] for a tiny ϵ\epsilon could still result in sampling either category 1 or 2. However, the output yy will jump from [1,0][1, 0] to [0,1][0, 1] depending on the sampled outcome, creating a discontinuity.

Likelihood Ratio Trick to the Rescue

The key insight is to express the gradient of the expectation in a way that moves the gradient operator inside the expectation.

θJ(θ)=θExpθ[f(x)]=θf(x)pθ(x)dx\nabla_\theta J(\theta) = \nabla_\theta \mathbb{E}_{x \sim p_\theta} [f(x)] = \nabla_\theta \int f(x) p_\theta(x) \, dx

Under regularity conditions(Leibniz Integral Rule and the Dominated Convergence Theorem, which remains an ambition for yours truly for another post), we can interchange the gradient and the integral:

θJ(θ)=f(x)θpθ(x)dx\nabla_\theta J(\theta) = \int f(x) \nabla_\theta p_\theta(x) \, dx

We can express θpθ(x)\nabla_\theta p_\theta(x) using the log-derivative trick as θpθ(x)=pθ(x)θlogpθ(x)\nabla_\theta p_\theta(x) = p_\theta(x) \nabla_\theta \log p_\theta(x). Plug that into the integral and write the integral as the expectation, our gradient becomes:

θJ(θ)=f(x)pθ(x)θlogpθ(x)dx\nabla_\theta J(\theta) = \int f(x) p_\theta(x) \nabla_\theta \log p_\theta(x) \, dx θJ(θ)=Expθ[f(x)θlogpθ(x)]\nabla_\theta J(\theta) = \mathbb{E}_{x \sim p_\theta} \left[ f(x) \nabla_\theta \log p_\theta(x) \right]

Intuition

REINFORCE Algorithm

References


Next Post
Softmaxing is actually Utility Maximization