Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

REINFORCE

Policy-Based Methods and Monte-Carlo Policy Gradients

In the previous chapter we scaled value-based reinforcement learning to high-dimensional observations by approximating Q(s,a)Q(s,a) with a neural network. This unlocks impressive applications, but it also exposes a structural limitation of value-based control: in order to act, we must solve

argmaxaQθ(s,a),\arg\max_a Q_\theta(s,a),

which is straightforward only when the action space is small and discrete. For continuous control (robot torque commands, steering angles, throttle), the maximization becomes a nontrivial optimization problem at every time step Lillicrap et al., 2016Sutton & Barto, 2018. Moreover, value-based methods typically induce a deterministic policy at exploitation time (greedy), or one with only fixed-rate uniform exploration (ϵ\epsilon-greedy), which can be brittle under partial observability or perceptual aliasing, where identical observations require different behaviors and stochastic policies can strictly dominate deterministic ones Chrisman, 1992Singh et al., 1994Sutton & Barto, 2018.

This chapter introduces the complementary viewpoint: policy-based reinforcement learning, where we represent the policy explicitly as a parameterized distribution πθ(as)\pi_\theta(a\mid s) and optimize it directly to maximize expected return Sutton & Barto, 2018Hugging Face, 2024. We will derive the fundamental gradient estimator behind modern policy optimization, and we will study the simplest policy-gradient algorithm, REINFORCE Williams, 1992.

Slides for this chapter (open full screen).

Value-Based vs. Policy-Based Methods

Let’s recap the two algorithm families in model-free deep RL Sutton & Barto, 2018Hugging Face, 2024:

The key shift is that in policy-based methods, the policy is not “read off” from a value function. It is a first-class object we optimize.

Value-based control (left) selects actions by comparing action-values, while policy-based control (right) samples actions directly from a parameterized distribution \pi_\theta(a\mid s).

Figure 2:Value-based control (left) selects actions by comparing action-values, while policy-based control (right) samples actions directly from a parameterized distribution πθ(as)\pi_\theta(a\mid s).

Policy-Based vs. Policy-Gradient Methods

Policy-based methods are a broad class. Some approaches search policy space without gradients, e.g. evolutionary strategies Salimans et al., 2017 or random search Mania et al., 2018, which have been shown to be competitive with gradient-based deep RL on several continuous-control benchmarks. Policy-gradient methods are the subset that performs gradient ascent on an objective J(θ)J(\theta) using the gradient of the objective θJ(θ)\nabla_\theta J(\theta) Sutton & Barto, 2018Hugging Face, 2024. REINFORCE is the simplest example: it uses Monte-Carlo returns from full episodes to estimate that gradient Williams, 1992.

Parameterizing a Stochastic Policy

Policy gradients are usually formulated for stochastic policies. For discrete actions, a neural network can output logits zθ(s)RAz_\theta(s)\in\mathbb{R}^{|\mathcal{A}|} and define

πθ(as)=softmax(zθ(s))a.\pi_\theta(a\mid s) = \mathrm{softmax}(z_\theta(s))_a.

The policy then samples an action Atπθ(St)A_t \sim \pi_\theta(\cdot\mid S_t) during interaction, rather than selecting argmaxaQθ(St,a)\arg\max_a Q_\theta(S_t,a).

For continuous actions, the same idea applies: the network outputs the parameters of a distribution, commonly a Gaussian with mean μθ(s)\mu_\theta(s) and (diagonal) standard deviation σθ(s)\sigma_\theta(s):

πθ(as)=N ⁣(a;μθ(s),diag(σθ(s)2)).\pi_\theta(a\mid s) = \mathcal{N}\!\left(a;\mu_\theta(s), \mathrm{diag}(\sigma_\theta(s)^2)\right).

We will return to the continuous case in later chapters when we discuss DDPG and PPO; for now, the key message is that the policy is differentiable with respect to θ\theta even when the environment dynamics are not.

A stochastic policy network. Given a state, a neural network outputs a distribution over actions. Discrete actions are typically produced via a softmax; continuous actions are often modeled with a Gaussian head.

Figure 3:A stochastic policy network. Given a state, a neural network outputs a distribution over actions. Discrete actions are typically produced via a softmax; continuous actions are often modeled with a Gaussian head.

Why Policy Gradients? Advantages and Disadvantages

Policy-gradient methods are not a strict replacement for value-based methods; they solve different pain points.

Advantages

  1. They can learn stochastic policies. This provides “built-in” exploration (the policy itself randomizes) and can resolve perceptual aliasing where a deterministic policy would get stuck Hugging Face, 2024Sutton & Barto, 2018.

  2. They handle continuous and high-dimensional action spaces naturally. We output distribution parameters, rather than needing to compute maxaQ(s,a)\max_a Q(s,a) over an uncountable action set Sutton & Barto, 2018.

  3. They change action probabilities smoothly. Small parameter updates typically induce small changes in πθ(as)\pi_\theta(a\mid s), unlike the hard max\max operator that can flip a greedy action abruptly for tiny changes in QQ Hugging Face, 2024.

Disadvantages

  1. They can converge to local optima. The objective is generally non-convex in θ\theta Sutton & Barto, 2018.

  2. They can be sample-inefficient. Many classical policy-gradient algorithms are on-policy and discard data after each update.

  3. They can have high-variance gradient estimates. REINFORCE in particular is an unbiased estimator with notoriously high variance, motivating baselines and critics Williams, 1992Greensmith et al., 2004.

The Policy Objective

We want to maximize expected return. Let a trajectory be

τ=(s0,a0,r1,s1,a1,r2,,sT1,aT1,rT),\tau = (s_0, a_0, r_1, s_1, a_1, r_2, \ldots, s_{T-1}, a_{T-1}, r_T),

generated by interaction with an environment when following πθ\pi_\theta. Define the (discounted) return of the trajectory as

R(τ)=t=0T1γtrt+1.R(\tau) = \sum_{t=0}^{T-1} \gamma^t r_{t+1}.

The standard objective is the expected return:

J(θ)=Eτπθ[R(τ)].J(\theta) = \mathbb{E}_{\tau\sim \pi_\theta}\left[ R(\tau) \right].

This is exactly the quantity we care about: how well the agent does on average when sampling actions from πθ\pi_\theta.

From J(θ)J(\theta) to a Usable Gradient

At first glance, differentiating J(θ)J(\theta) looks hopeless: JJ depends on the probability of trajectories, which depends not only on the policy but also on the environment dynamics. The core insight behind policy gradients is that we can estimate θJ(θ)\nabla_\theta J(\theta) without differentiating the environment.

Trajectory probability

For an MDP with initial state distribution p(s0)p(s_0) and transition dynamics p(st+1st,at)p(s_{t+1}\mid s_t,a_t), the probability of a trajectory under policy πθ\pi_\theta is

p(τ;θ)=p(s0)t=0T1πθ(atst)p(st+1st,at).p(\tau;\theta) = p(s_0)\prod_{t=0}^{T-1}\pi_\theta(a_t\mid s_t)\,p(s_{t+1}\mid s_t,a_t).

Only the policy term depends on θ\theta.

The log-likelihood trick (REINFORCE trick)

The identity we will use comes from the chain rule applied to logp\log p:

θlogp(x;θ)=θp(x;θ)p(x;θ)θp(x;θ)=p(x;θ)θlogp(x;θ).\nabla_\theta \log p(x;\theta) = \frac{\nabla_\theta p(x;\theta)}{p(x;\theta)} \quad\Longleftrightarrow\quad \nabla_\theta p(x;\theta) = p(x;\theta)\,\nabla_\theta \log p(x;\theta).

The right-hand form is the useful one: it lets us turn a gradient of pp (which we don’t have a handle on) into a gradient of logp\log p weighted by pp — and weighting by pp is exactly what turns a sum into an expectation under πθ\pi_\theta.

Writing the expectation in (6) explicitly as a sum over trajectories,

J(θ)=τp(τ;θ)R(τ),J(\theta) = \sum_\tau p(\tau;\theta)\,R(\tau),

we differentiate and apply the identity above:

θJ(θ)=τθp(τ;θ)R(τ)=τp(τ;θ)θlogp(τ;θ)R(τ)=Eτπθ[θlogp(τ;θ)R(τ)].\begin{align} \nabla_\theta J(\theta) &= \sum_\tau \nabla_\theta p(\tau;\theta)\,R(\tau) \\ &= \sum_\tau p(\tau;\theta)\,\nabla_\theta \log p(\tau;\theta)\,R(\tau) \\ &= \mathbb{E}_{\tau\sim \pi_\theta}\left[\nabla_\theta \log p(\tau;\theta)\,R(\tau)\right]. \end{align}

Now expand the log of (7), using log(ab)=loga+logb\log(ab) = \log a + \log b to turn the product into a sum:

logp(τ;θ)=logp(s0)+t=0T1logπθ(atst)+t=0T1logp(st+1st,at).\log p(\tau;\theta) = \log p(s_0) + \sum_{t=0}^{T-1}\log \pi_\theta(a_t\mid s_t) + \sum_{t=0}^{T-1}\log p(s_{t+1}\mid s_t,a_t).

The first and last terms do not depend on θ\theta, so their gradients vanish. Therefore:

θlogp(τ;θ)=t=0T1θlogπθ(atst).\nabla_\theta \log p(\tau;\theta) = \sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t\mid s_t).

Putting everything together gives the fundamental policy-gradient estimator:

θJ(θ)=Eτπθ[t=0T1θlogπθ(atst)R(τ)].\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim \pi_\theta}\left[\sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t\mid s_t)\,R(\tau)\right].

This is the REINFORCE estimator introduced by Williams (1992). It is unbiased, but (as we will discuss) it can have high variance.

Causality: from R(τ)R(\tau) to return-to-go

A small but important refinement: although R(τ)R(\tau) multiplies every time step in the sum above, rewards received before time tt are determined by actions taken before ata_t and so are independent of ata_t. Their expected gradient contribution therefore vanishes, and we may replace R(τ)R(\tau) at each time step by the return-to-go

Gt=k=tT1γktrk+1,G_t = \sum_{k=t}^{T-1}\gamma^{k-t} r_{k+1},

without introducing bias, while reducing variance because we no longer multiply each log-probability by rewards independent of the action being scored Sutton & Barto, 2018. This yields the form actually used in practice:

θJ(θ)=Eτπθ[t=0T1θlogπθ(atst)Gt].\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim \pi_\theta}\left[\sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t\mid s_t)\,G_t\right].

The REINFORCE Algorithm (Monte-Carlo Policy Gradient)

REINFORCE is the “purest” policy-gradient method: it uses Monte-Carlo estimation from complete episodes to compute returns GtG_t, then performs a gradient ascent step with learning rate α>0\alpha > 0. Replacing the expectation with a single sampled trajectory and using the gradient form derived above, the REINFORCE update is

θθ+αt=0T1θlogπθ(atst)Gt.\theta \leftarrow \theta + \alpha \sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t\mid s_t)\,G_t.

Intuition: θlogπθ(atst)\nabla_\theta \log \pi_\theta(a_t\mid s_t) points in the direction that increases the probability of taking the sampled action ata_t at state sts_t. If GtG_t is large, we push that action to become more likely in the future; if GtG_t is small (or negative), we push it down Williams, 1992Sutton & Barto, 2018.

The REINFORCE training loop: collect a full episode using \pi_\theta, compute returns G_t, and perform gradient ascent on \sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)G_t.

Figure 4:The REINFORCE training loop: collect a full episode using πθ\pi_\theta, compute returns GtG_t, and perform gradient ascent on tθlogπθ(atst)Gt\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)G_t.

Putting these pieces together gives the REINFORCE algorithm of Williams (1992):

  1. Initialize policy parameters θ\theta (e.g. randomly) and choose a learning rate α>0\alpha > 0.

  2. For each episode:

    1. Generate a full trajectory τ=(s0,a0,r1,s1,a1,r2,,sT1,aT1,rT)\tau = (s_0, a_0, r_1, s_1, a_1, r_2, \ldots, s_{T-1}, a_{T-1}, r_T) by sampling atπθ(st)a_t \sim \pi_\theta(\cdot\mid s_t) until termination.

    2. For each time step t=0,1,,T1t = 0, 1, \ldots, T-1:

      1. Compute the return-to-go Gtk=tT1γktrk+1G_t \leftarrow \sum_{k=t}^{T-1}\gamma^{k-t} r_{k+1}.

    3. Update the parameters with one gradient-ascent step:

      θθ+αt=0T1θlogπθ(atst)Gt.\theta \leftarrow \theta + \alpha \sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t\mid s_t)\,G_t.

Reducing Variance with a Baseline

The REINFORCE estimator is unbiased but, as noted, can have high variance. A simple and powerful variance-reduction trick is to subtract a baseline b(st)b(s_t) from each return:

θJ(θ)=Eτπθ[t=0T1θlogπθ(atst)(Gtb(st))].\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim \pi_\theta}\left[\sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t\mid s_t)\,(G_t - b(s_t))\right].

The baseline must depend only on the state, not on the action ata_t. Under that condition it leaves the gradient unbiased:

Eaπθ(s)[θlogπθ(as)b(s)]=b(s)θaπθ(as)=b(s)θ1=0,\mathbb{E}_{a\sim\pi_\theta(\cdot\mid s)}\left[\nabla_\theta \log \pi_\theta(a\mid s)\,b(s)\right] = b(s)\,\nabla_\theta \sum_a \pi_\theta(a\mid s) = b(s)\,\nabla_\theta 1 = 0,

since the score function has zero mean under πθ\pi_\theta. A natural choice is the state-value function b(s)=Vπθ(s)b(s) = V^{\pi_\theta}(s), in which case GtVπθ(st)G_t - V^{\pi_\theta}(s_t) becomes an estimate of the advantage Aπθ(st,at)A^{\pi_\theta}(s_t, a_t). Replacing the Monte-Carlo return GtG_t with a learned estimate of the advantage is exactly the step that takes us from REINFORCE to actor-critic methods Greensmith et al., 2004.

Practical Considerations (What People Actually Do)

Even for REINFORCE, implementations typically add a few pragmatic ingredients:

Limitations

REINFORCE makes the core idea of policy gradients visible, but it is rarely used as a final algorithm:

Summary

This chapter introduced policy-based reinforcement learning and derived the REINFORCE algorithm:

In the next chapter we will generalize this idea via the policy gradient theorem and replace the Monte-Carlo return with a learned critic, leading to actor-critic methods.

References
  1. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). Continuous Control with Deep Reinforcement Learning. Proceedings of the 4th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1509.02971
  2. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
  3. Chrisman, L. (1992). Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach. Proceedings of the Tenth National Conference on Artificial Intelligence, 183–188. https://aaai.org/papers/00183-aaai92-029-reinforcement-learning-with-perceptual-aliasing-the-perceptual-distinctions-approach/
  4. Singh, S. P., Jaakkola, T., & Jordan, M. I. (1994). Learning Without State-Estimation in Partially Observable Markovian Decision Processes. In W. W. Cohen & H. Hirsh (Eds.), Proceedings of the Eleventh International Conference on Machine Learning (ICML) (pp. 284–292). Morgan Kaufmann. http://www.eecs.umich.edu/~baveja/Papers/ML94.pdf
  5. Hugging Face. (2024). Hugging Face Deep Reinforcement Learning Course. Online course. https://huggingface.co/learn/deep-rl-course/en/unit0/introduction
  6. Williams, R. J. (1992). Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8(3–4), 229–256. 10.1007/BF00992696
  7. Watkins, C. J. C. H., & Dayan, P. (1992). Q-Learning. Machine Learning, 8(3–4), 279–292. 10.1007/BF00992698
  8. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-Level Control through Deep Reinforcement Learning. Nature, 518(7540), 529–533. 10.1038/nature14236
  9. Salimans, T., Ho, J., Chen, X., Sidor, S., & Sutskever, I. (2017). Evolution Strategies as a Scalable Alternative to Reinforcement Learning. arXiv Preprint arXiv:1703.03864. https://arxiv.org/abs/1703.03864
  10. Mania, H., Guy, A., & Recht, B. (2018). Simple Random Search of Static Linear Policies is Competitive for Reinforcement Learning. Advances in Neural Information Processing Systems (NeurIPS), 31. https://proceedings.neurips.cc/paper_files/paper/2018/hash/7634ea65a4e6d9041cfd3f7de18e334a-Abstract.html
  11. Greensmith, E., Bartlett, P. L., & Baxter, J. (2004). Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning. Journal of Machine Learning Research, 5, 1471–1530. https://www.jmlr.org/papers/v5/greensmith04a.html