REINFORCE
Policy-Based Methods and Monte-Carlo Policy Gradients
In the previous chapter we scaled value-based reinforcement learning to high-dimensional observations by approximating with a neural network. This unlocks impressive applications, but it also exposes a structural limitation of value-based control: in order to act, we must solve
which is straightforward only when the action space is small and discrete. For continuous control (robot torque commands, steering angles, throttle), the maximization becomes a nontrivial optimization problem at every time step Lillicrap et al., 2016Sutton & Barto, 2018. Moreover, value-based methods typically induce a deterministic policy at exploitation time (greedy), or one with only fixed-rate uniform exploration (-greedy), which can be brittle under partial observability or perceptual aliasing, where identical observations require different behaviors and stochastic policies can strictly dominate deterministic ones Chrisman, 1992Singh et al., 1994Sutton & Barto, 2018.
This chapter introduces the complementary viewpoint: policy-based reinforcement learning, where we represent the policy explicitly as a parameterized distribution and optimize it directly to maximize expected return Sutton & Barto, 2018Hugging Face, 2024. We will derive the fundamental gradient estimator behind modern policy optimization, and we will study the simplest policy-gradient algorithm, REINFORCE Williams, 1992.
Slides for this chapter (open full screen).
Value-Based vs. Policy-Based Methods¶
Let’s recap the two algorithm families in model-free deep RL Sutton & Barto, 2018Hugging Face, 2024:
Value-based methods learn a value function, such as , and derive a policy implicitly, e.g. by acting greedily. Q-Learning and DQN are canonical examples Watkins & Dayan, 1992Mnih et al., 2015.
Policy-based methods parameterize the policy directly and optimize the parameters to maximize expected return.
The key shift is that in policy-based methods, the policy is not “read off” from a value function. It is a first-class object we optimize.
Figure 2:Value-based control (left) selects actions by comparing action-values, while policy-based control (right) samples actions directly from a parameterized distribution .
Policy-Based vs. Policy-Gradient Methods¶
Policy-based methods are a broad class. Some approaches search policy space without gradients, e.g. evolutionary strategies Salimans et al., 2017 or random search Mania et al., 2018, which have been shown to be competitive with gradient-based deep RL on several continuous-control benchmarks. Policy-gradient methods are the subset that performs gradient ascent on an objective using the gradient of the objective Sutton & Barto, 2018Hugging Face, 2024. REINFORCE is the simplest example: it uses Monte-Carlo returns from full episodes to estimate that gradient Williams, 1992.
Parameterizing a Stochastic Policy¶
Policy gradients are usually formulated for stochastic policies. For discrete actions, a neural network can output logits and define
The policy then samples an action during interaction, rather than selecting .
For continuous actions, the same idea applies: the network outputs the parameters of a distribution, commonly a Gaussian with mean and (diagonal) standard deviation :
We will return to the continuous case in later chapters when we discuss DDPG and PPO; for now, the key message is that the policy is differentiable with respect to even when the environment dynamics are not.
Figure 3:A stochastic policy network. Given a state, a neural network outputs a distribution over actions. Discrete actions are typically produced via a softmax; continuous actions are often modeled with a Gaussian head.
Why Policy Gradients? Advantages and Disadvantages¶
Policy-gradient methods are not a strict replacement for value-based methods; they solve different pain points.
Advantages¶
They can learn stochastic policies. This provides “built-in” exploration (the policy itself randomizes) and can resolve perceptual aliasing where a deterministic policy would get stuck Hugging Face, 2024Sutton & Barto, 2018.
They handle continuous and high-dimensional action spaces naturally. We output distribution parameters, rather than needing to compute over an uncountable action set Sutton & Barto, 2018.
They change action probabilities smoothly. Small parameter updates typically induce small changes in , unlike the hard operator that can flip a greedy action abruptly for tiny changes in Hugging Face, 2024.
Disadvantages¶
They can converge to local optima. The objective is generally non-convex in Sutton & Barto, 2018.
They can be sample-inefficient. Many classical policy-gradient algorithms are on-policy and discard data after each update.
They can have high-variance gradient estimates. REINFORCE in particular is an unbiased estimator with notoriously high variance, motivating baselines and critics Williams, 1992Greensmith et al., 2004.
The Policy Objective¶
We want to maximize expected return. Let a trajectory be
generated by interaction with an environment when following . Define the (discounted) return of the trajectory as
The standard objective is the expected return:
This is exactly the quantity we care about: how well the agent does on average when sampling actions from .
From to a Usable Gradient¶
At first glance, differentiating looks hopeless: depends on the probability of trajectories, which depends not only on the policy but also on the environment dynamics. The core insight behind policy gradients is that we can estimate without differentiating the environment.
Trajectory probability¶
For an MDP with initial state distribution and transition dynamics , the probability of a trajectory under policy is
Only the policy term depends on .
The log-likelihood trick (REINFORCE trick)¶
The identity we will use comes from the chain rule applied to :
The right-hand form is the useful one: it lets us turn a gradient of (which we don’t have a handle on) into a gradient of weighted by — and weighting by is exactly what turns a sum into an expectation under .
Writing the expectation in (6) explicitly as a sum over trajectories,
we differentiate and apply the identity above:
Now expand the log of (7), using to turn the product into a sum:
The first and last terms do not depend on , so their gradients vanish. Therefore:
Putting everything together gives the fundamental policy-gradient estimator:
This is the REINFORCE estimator introduced by Williams (1992). It is unbiased, but (as we will discuss) it can have high variance.
Causality: from to return-to-go¶
A small but important refinement: although multiplies every time step in the sum above, rewards received before time are determined by actions taken before and so are independent of . Their expected gradient contribution therefore vanishes, and we may replace at each time step by the return-to-go
without introducing bias, while reducing variance because we no longer multiply each log-probability by rewards independent of the action being scored Sutton & Barto, 2018. This yields the form actually used in practice:
The REINFORCE Algorithm (Monte-Carlo Policy Gradient)¶
REINFORCE is the “purest” policy-gradient method: it uses Monte-Carlo estimation from complete episodes to compute returns , then performs a gradient ascent step with learning rate . Replacing the expectation with a single sampled trajectory and using the gradient form derived above, the REINFORCE update is
Intuition: points in the direction that increases the probability of taking the sampled action at state . If is large, we push that action to become more likely in the future; if is small (or negative), we push it down Williams, 1992Sutton & Barto, 2018.
Figure 4:The REINFORCE training loop: collect a full episode using , compute returns , and perform gradient ascent on .
Putting these pieces together gives the REINFORCE algorithm of Williams (1992):
Initialize policy parameters (e.g. randomly) and choose a learning rate .
For each episode:
Generate a full trajectory by sampling until termination.
For each time step :
Compute the return-to-go .
Update the parameters with one gradient-ascent step:
Reducing Variance with a Baseline¶
The REINFORCE estimator is unbiased but, as noted, can have high variance. A simple and powerful variance-reduction trick is to subtract a baseline from each return:
The baseline must depend only on the state, not on the action . Under that condition it leaves the gradient unbiased:
since the score function has zero mean under . A natural choice is the state-value function , in which case becomes an estimate of the advantage . Replacing the Monte-Carlo return with a learned estimate of the advantage is exactly the step that takes us from REINFORCE to actor-critic methods Greensmith et al., 2004.
Practical Considerations (What People Actually Do)¶
Even for REINFORCE, implementations typically add a few pragmatic ingredients:
Batching episodes. Gradient estimates become more stable if we average over multiple trajectories before updating.
Entropy regularization. Adding an entropy bonus encourages exploration and prevents premature collapse to a near-deterministic policy.
Return normalization. Normalizing the batch of returns (or advantages) to zero mean and unit variance is a common heuristic that can improve optimization stability.
Limitations¶
REINFORCE makes the core idea of policy gradients visible, but it is rarely used as a final algorithm:
Its gradient estimates can have very high variance, even with baselines Greensmith et al., 2004.
It is typically on-policy, so it can be data-hungry.
Summary¶
This chapter introduced policy-based reinforcement learning and derived the REINFORCE algorithm:
Policy-based methods optimize a parameterized policy directly, rather than deriving it from a learned value function.
Policy-gradient methods perform gradient ascent on expected return using sampled trajectories.
Using the log-likelihood trick, the gradient depends only on and not on differentiating the environment dynamics Williams, 1992.
REINFORCE updates the policy using Monte-Carlo returns (or reward-to-go) and is unbiased but high-variance.
Baselines preserve unbiasedness and can greatly reduce variance, motivating advantages and actor-critic methods Greensmith et al., 2004.
In the next chapter we will generalize this idea via the policy gradient theorem and replace the Monte-Carlo return with a learned critic, leading to actor-critic methods.
- Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). Continuous Control with Deep Reinforcement Learning. Proceedings of the 4th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1509.02971
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
- Chrisman, L. (1992). Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach. Proceedings of the Tenth National Conference on Artificial Intelligence, 183–188. https://aaai.org/papers/00183-aaai92-029-reinforcement-learning-with-perceptual-aliasing-the-perceptual-distinctions-approach/
- Singh, S. P., Jaakkola, T., & Jordan, M. I. (1994). Learning Without State-Estimation in Partially Observable Markovian Decision Processes. In W. W. Cohen & H. Hirsh (Eds.), Proceedings of the Eleventh International Conference on Machine Learning (ICML) (pp. 284–292). Morgan Kaufmann. http://www.eecs.umich.edu/~baveja/Papers/ML94.pdf
- Hugging Face. (2024). Hugging Face Deep Reinforcement Learning Course. Online course. https://huggingface.co/learn/deep-rl-course/en/unit0/introduction
- Williams, R. J. (1992). Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8(3–4), 229–256. 10.1007/BF00992696
- Watkins, C. J. C. H., & Dayan, P. (1992). Q-Learning. Machine Learning, 8(3–4), 279–292. 10.1007/BF00992698
- Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-Level Control through Deep Reinforcement Learning. Nature, 518(7540), 529–533. 10.1038/nature14236
- Salimans, T., Ho, J., Chen, X., Sidor, S., & Sutskever, I. (2017). Evolution Strategies as a Scalable Alternative to Reinforcement Learning. arXiv Preprint arXiv:1703.03864. https://arxiv.org/abs/1703.03864
- Mania, H., Guy, A., & Recht, B. (2018). Simple Random Search of Static Linear Policies is Competitive for Reinforcement Learning. Advances in Neural Information Processing Systems (NeurIPS), 31. https://proceedings.neurips.cc/paper_files/paper/2018/hash/7634ea65a4e6d9041cfd3f7de18e334a-Abstract.html
- Greensmith, E., Bartlett, P. L., & Baxter, J. (2004). Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning. Journal of Machine Learning Research, 5, 1471–1530. https://www.jmlr.org/papers/v5/greensmith04a.html