The cruel beauty of life is that reality often operates in discrete jumps rather than smooth transitions. Success and failure in some random context is like a light switch - it’s either on or off, with no dimmer setting. This can feel particularly harsh because our human intuition often wants to recognize “degrees of success” or give credit for effort and near-misses and more importantly learn from them, but many real-world systems don’t work that way. Well in this post, we will see how we can deal with this problem in the context of reinforcement learning.
Familiar Scenario
If you are an econometrician or a machine learning engineer in many optimization problems, especially in supervised learning, we often estimate expectations of loss functions using sample averages and compute gradients directly:
We then can easily compute gradients with respect to :
The function is directly parameterized by , and are samples from a fixed data distribution independent of . We can compute because is differentiable with respect to .
The Problem
Now what happens when the parameter is the paramater of the distribution of ? In math the objective function is now:
And say, as every normal person would, you want to compute the gradient to use in optimization algorithms like gradient ascent or descent. But you have one problem. Sampling is non-differentiable because it introduces a discontinuity as small changes in the input probabilities can result in discrete, non-continuous jumps in the output
For example: Consider sampling from . A small change to for a tiny could still result in sampling either category 1 or 2. However, the output will jump from to depending on the sampled outcome, creating a discontinuity.
Likelihood Ratio Trick to the Rescue
The key insight is to express the gradient of the expectation in a way that moves the gradient operator inside the expectation.
Under regularity conditions(Leibniz Integral Rule and the Dominated Convergence Theorem, which remains an ambition for yours truly for another post), we can interchange the gradient and the integral:
We can express using the log-derivative trick as . Plug that into the integral and write the integral as the expectation, our gradient becomes:
Intuition
- Score Function : Measures how the log-probability of changes with .
- Weighting by Tells us how much ( x ) contributes to the expected value we’re optimizing.
- Gradient as ExpectationBy expressing the gradient as an expectation, we can estimate it using Monte Carlo sampling.
- This formualtion of the gradient is useful because we often parametrize the p(x) with a neural network.
- Theis estimator of the gradient often has high variance so there is a literature on how to reduce the variance of the estimator.
- There are other tricks that make sampling differentiable like Gumbel-Softmax or Reparameterization trick.
REINFORCE Algorithm
References
- Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, pdf (Reinforce)
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. free e-book