Likelihood Ratio Trick a.k.a REINFORCE

The cruel beauty of life is that reality often operates in discrete jumps rather than smooth transitions. Success and failure in some random context is like a light switch - it’s either on or off, with no dimmer setting. This can feel particularly harsh because our human intuition often wants to recognize “degrees of success” or give credit for effort and near-misses and more importantly learn from them, but many real-world systems don’t work that way. Well in this post, we will see how we can deal with this problem in the context of reinforcement learning.

Familiar Scenario

If you are an econometrician or a machine learning engineer in many optimization problems, especially in supervised learning, we often estimate expectations of loss functions using sample averages and compute gradients directly:

J(\theta) = \mathbb{E}_{x \sim p_{\text{data}}} \left[ f_\theta(x) \right] \approx \frac{1}{N} \sum_{i=1}^N f_\theta \left( x^{(i)} \right)

We then can easily compute gradients with respect to $\theta$ :

\nabla_\theta J(\theta) \approx \frac{1}{N} \sum_{i=1}^N \nabla_\theta f_\theta \left( x^{(i)} \right)

The function $f_\theta(x)$ is directly parameterized by $\theta$ , and $x^{(i)}$ are samples from a fixed data distribution independent of $\theta$ . We can compute $\nabla_\theta f_\theta \left( x^{(i)} \right)$ because $f_\theta$ is differentiable with respect to $\theta$ .

The Problem

Now what happens when the parameter $\theta$ is the paramater of the distribution of $x$ ? In math the objective function is now:

J(\theta) = \mathbb{E}_{x \sim p_\theta} [f(x)]

And say, as every normal person would, you want to compute the gradient $\nabla_\theta J(\theta)$ to use in optimization algorithms like gradient ascent or descent. But you have one problem. Sampling is non-differentiable because it introduces a discontinuity as small changes in the input probabilities $p$ can result in discrete, non-continuous jumps in the output $y.$

For example: Consider sampling from $p = [0.4, 0.6]$ . A small change to $p = [0.4 + \epsilon, 0.6 - \epsilon]$ for a tiny $\epsilon$ could still result in sampling either category 1 or 2. However, the output $y$ will jump from $[1, 0]$ to $[0, 1]$ depending on the sampled outcome, creating a discontinuity.

Likelihood Ratio Trick to the Rescue

The key insight is to express the gradient of the expectation in a way that moves the gradient operator inside the expectation.

\nabla_\theta J(\theta) = \nabla_\theta \mathbb{E}_{x \sim p_\theta} [f(x)] = \nabla_\theta \int f(x) p_\theta(x) \, dx

Under regularity conditions(Leibniz Integral Rule and the Dominated Convergence Theorem, which remains an ambition for yours truly for another post), we can interchange the gradient and the integral:

\nabla_\theta J(\theta) = \int f(x) \nabla_\theta p_\theta(x) \, dx

We can express $\nabla_\theta p_\theta(x)$ using the log-derivative trick as $\nabla_\theta p_\theta(x) = p_\theta(x) \nabla_\theta \log p_\theta(x)$ . Plug that into the integral and write the integral as the expectation, our gradient becomes:

\nabla_\theta J(\theta) = \int f(x) p_\theta(x) \nabla_\theta \log p_\theta(x) \, dx

\nabla_\theta J(\theta) = \mathbb{E}_{x \sim p_\theta} \left[ f(x) \nabla_\theta \log p_\theta(x) \right]

Intuition

Score Function $\nabla_\theta \log p_\theta(x)$ : Measures how the log-probability of $x$ changes with $\theta$ .
Weighting by $f(x)$ Tells us how much ( x ) contributes to the expected value we’re optimizing.
Gradient as ExpectationBy expressing the gradient as an expectation, we can estimate it using Monte Carlo sampling.
This formualtion of the gradient is useful because we often parametrize the p(x) with a neural network.
Theis estimator of the gradient often has high variance so there is a literature on how to reduce the variance of the estimator.
There are other tricks that make sampling differentiable like Gumbel-Softmax or Reparameterization trick.

REINFORCE Algorithm

References

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, pdf (Reinforce)
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. free e-book