REINFORCE

In the previous chapter we scaled value-based reinforcement learning to high-dimensional observations by approximating $Q(s,a)$ with a neural network. This unlocks impressive applications, but it also exposes a structural limitation of value-based control: in order to act, we must solve

\arg\max_a Q_\theta(s,a),

(1)

which is straightforward only when the action space is small and discrete. For continuous control (robot torque commands, steering angles, throttle), the maximization becomes a nontrivial optimization problem at every time step Lillicrap et al., 2016Sutton & Barto, 2018. Moreover, value-based methods typically induce a deterministic policy at exploitation time (greedy), or one with only fixed-rate uniform exploration ( $\epsilon$ -greedy), which can be brittle under partial observability or perceptual aliasing, where identical observations require different behaviors and stochastic policies can strictly dominate deterministic ones Chrisman, 1992Singh et al., 1994Sutton & Barto, 2018.

This chapter introduces the complementary viewpoint: policy-based reinforcement learning, where we represent the policy explicitly as a parameterized distribution $\pi_\theta(a\mid s)$ and optimize it directly to maximize expected return Sutton & Barto, 2018Hugging Face, 2024. We will derive the fundamental gradient estimator behind modern policy optimization, and we will study the simplest policy-gradient algorithm, REINFORCE Williams, 1992.

Slides for this chapter (open full screen).

Value-Based vs. Policy-Based Methods¶

Let’s recap the two algorithm families in model-free deep RL Sutton & Barto, 2018Hugging Face, 2024:

Value-based methods learn a value function, such as $Q_\theta(s,a)$ , and derive a policy implicitly, e.g. by acting greedily. Q-Learning and DQN are canonical examples Watkins & Dayan, 1992Mnih et al., 2015.
Policy-based methods parameterize the policy $\pi_\theta$ directly and optimize the parameters to maximize expected return.

The key shift is that in policy-based methods, the policy is not “read off” from a value function. It is a first-class object we optimize.

Value-based control (left) selects actions by comparing action-values, while policy-based control (right) samples actions directly from a parameterized distribution \pi_\theta(a\mid s). — Figure 2:Value-based control (left) selects actions by comparing action-values, while policy-based control (right) samples actions directly from a parameterized distribution $\pi_\theta(a\mid s)$ .

Policy-Based vs. Policy-Gradient Methods¶

Policy-based methods are a broad class. Some approaches search policy space without gradients, e.g. evolutionary strategies Salimans et al., 2017 or random search Mania et al., 2018, which have been shown to be competitive with gradient-based deep RL on several continuous-control benchmarks. Policy-gradient methods are the subset that performs gradient ascent on an objective $J(\theta)$ using the gradient of the objective $\nabla_\theta J(\theta)$ Sutton & Barto, 2018Hugging Face, 2024. REINFORCE is the simplest example: it uses Monte-Carlo returns from full episodes to estimate that gradient Williams, 1992.

Parameterizing a Stochastic Policy¶

Policy gradients are usually formulated for stochastic policies. For discrete actions, a neural network can output logits $z_\theta(s)\in\mathbb{R}^{|\mathcal{A}|}$ and define

\pi_\theta(a\mid s) = \mathrm{softmax}(z_\theta(s))_a.

(2)

The policy then samples an action $A_t \sim \pi_\theta(\cdot\mid S_t)$ during interaction, rather than selecting $\arg\max_a Q_\theta(S_t,a)$ .

For continuous actions, the same idea applies: the network outputs the parameters of a distribution, commonly a Gaussian with mean $\mu_\theta(s)$ and (diagonal) standard deviation $\sigma_\theta(s)$ :

\pi_\theta(a\mid s) = \mathcal{N}\!\left(a;\mu_\theta(s), \mathrm{diag}(\sigma_\theta(s)^2)\right).

(3)

We will return to the continuous case in later chapters when we discuss DDPG and PPO; for now, the key message is that the policy is differentiable with respect to $\theta$ even when the environment dynamics are not.

Figure 3:A stochastic policy network. Given a state, a neural network outputs a distribution over actions. Discrete actions are typically produced via a softmax; continuous actions are often modeled with a Gaussian head.

Why Policy Gradients? Advantages and Disadvantages¶

Policy-gradient methods are not a strict replacement for value-based methods; they solve different pain points.

Advantages¶

They can learn stochastic policies. This provides “built-in” exploration (the policy itself randomizes) and can resolve perceptual aliasing where a deterministic policy would get stuck Hugging Face, 2024Sutton & Barto, 2018.
They handle continuous and high-dimensional action spaces naturally. We output distribution parameters, rather than needing to compute $\max_a Q(s,a)$ over an uncountable action set Sutton & Barto, 2018.
They change action probabilities smoothly. Small parameter updates typically induce small changes in $\pi_\theta(a\mid s)$ , unlike the hard $\max$ operator that can flip a greedy action abruptly for tiny changes in $Q$ Hugging Face, 2024.

Disadvantages¶

They can converge to local optima. The objective is generally non-convex in $\theta$ Sutton & Barto, 2018.
They can be sample-inefficient. Many classical policy-gradient algorithms are on-policy and discard data after each update.
They can have high-variance gradient estimates. REINFORCE in particular is an unbiased estimator with notoriously high variance, motivating baselines and critics Williams, 1992Greensmith et al., 2004.

The Policy Objective¶

We want to maximize expected return. Let a trajectory be

\tau = (s_0, a_0, r_1, s_1, a_1, r_2, \ldots, s_{T-1}, a_{T-1}, r_T),

(4)

generated by interaction with an environment when following $\pi_\theta$ . Define the (discounted) return of the trajectory as

R(\tau) = \sum_{t=0}^{T-1} \gamma^t r_{t+1}.

(5)

The standard objective is the expected return:

J(\theta) = \mathbb{E}_{\tau\sim \pi_\theta}\left[ R(\tau) \right].

(6)

This is exactly the quantity we care about: how well the agent does on average when sampling actions from $\pi_\theta$ .

From $J(\theta)$ to a Usable Gradient¶

At first glance, differentiating $J(\theta)$ looks hopeless: $J$ depends on the probability of trajectories, which depends not only on the policy but also on the environment dynamics. The core insight behind policy gradients is that we can estimate $\nabla_\theta J(\theta)$ without differentiating the environment.

Trajectory probability¶

For an MDP with initial state distribution $p(s_0)$ and transition dynamics $p(s_{t+1}\mid s_t,a_t)$ , the probability of a trajectory under policy $\pi_\theta$ is

p(\tau;\theta) = p(s_0)\prod_{t=0}^{T-1}\pi_\theta(a_t\mid s_t)\,p(s_{t+1}\mid s_t,a_t).

(7)

Only the policy term depends on $\theta$ .

The log-likelihood trick (REINFORCE trick)¶

The identity we will use comes from the chain rule applied to $\log p$ :

\nabla_\theta \log p(x;\theta) = \frac{\nabla_\theta p(x;\theta)}{p(x;\theta)} \quad\Longleftrightarrow\quad \nabla_\theta p(x;\theta) = p(x;\theta)\,\nabla_\theta \log p(x;\theta).

(8)

The right-hand form is the useful one: it lets us turn a gradient of $p$ (which we don’t have a handle on) into a gradient of $\log p$ weighted by $p$ — and weighting by $p$ is exactly what turns a sum into an expectation under $\pi_\theta$ .

Writing the expectation in (6) explicitly as a sum over trajectories,

J(\theta) = \sum_\tau p(\tau;\theta)\,R(\tau),

(9)

we differentiate and apply the identity above:

\begin{align} \nabla_\theta J(\theta) &= \sum_\tau \nabla_\theta p(\tau;\theta)\,R(\tau) \\ &= \sum_\tau p(\tau;\theta)\,\nabla_\theta \log p(\tau;\theta)\,R(\tau) \\ &= \mathbb{E}_{\tau\sim \pi_\theta}\left[\nabla_\theta \log p(\tau;\theta)\,R(\tau)\right]. \end{align}

(10)

Now expand the log of (7), using $\log(ab) = \log a + \log b$ to turn the product into a sum:

\log p(\tau;\theta) = \log p(s_0) + \sum_{t=0}^{T-1}\log \pi_\theta(a_t\mid s_t) + \sum_{t=0}^{T-1}\log p(s_{t+1}\mid s_t,a_t).

(11)

The first and last terms do not depend on $\theta$ , so their gradients vanish. Therefore:

\nabla_\theta \log p(\tau;\theta) = \sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t\mid s_t).

(12)

Putting everything together gives the fundamental policy-gradient estimator:

\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim \pi_\theta}\left[\sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t\mid s_t)\,R(\tau)\right].

(13)

This is the REINFORCE estimator introduced by Williams (1992). It is unbiased, but (as we will discuss) it can have high variance.

Causality: from $R(\tau)$ to return-to-go¶

A small but important refinement: although $R(\tau)$ multiplies every time step in the sum above, rewards received before time $t$ are determined by actions taken before $a_t$ and so are independent of $a_t$ . Their expected gradient contribution therefore vanishes, and we may replace $R(\tau)$ at each time step by the return-to-go

G_t = \sum_{k=t}^{T-1}\gamma^{k-t} r_{k+1},

(14)

without introducing bias, while reducing variance because we no longer multiply each log-probability by rewards independent of the action being scored Sutton & Barto, 2018. This yields the form actually used in practice:

\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim \pi_\theta}\left[\sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t\mid s_t)\,G_t\right].

(15)

The REINFORCE Algorithm (Monte-Carlo Policy Gradient)¶

REINFORCE is the “purest” policy-gradient method: it uses Monte-Carlo estimation from complete episodes to compute returns $G_t$ , then performs a gradient ascent step with learning rate $\alpha > 0$ . Replacing the expectation with a single sampled trajectory and using the gradient form derived above, the REINFORCE update is

\theta \leftarrow \theta + \alpha \sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t\mid s_t)\,G_t.

(16)

Intuition: $\nabla_\theta \log \pi_\theta(a_t\mid s_t)$ points in the direction that increases the probability of taking the sampled action $a_t$ at state $s_t$ . If $G_t$ is large, we push that action to become more likely in the future; if $G_t$ is small (or negative), we push it down Williams, 1992Sutton & Barto, 2018.

The REINFORCE training loop: collect a full episode using \pi_\theta, compute returns G_t, and perform gradient ascent on \sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)G_t. — Figure 4:The REINFORCE training loop: collect a full episode using $\pi_\theta$ , compute returns $G_t$ , and perform gradient ascent on $\sum_t \nabla_\theta \log \pi_\theta(a_t\mid s_t)G_t$ .

Putting these pieces together gives the REINFORCE algorithm of Williams (1992):

Initialize policy parameters $\theta$ (e.g. randomly) and choose a learning rate $\alpha > 0$ .
For each episode:
1. Generate a full trajectory $\tau = (s_0, a_0, r_1, s_1, a_1, r_2, \ldots, s_{T-1}, a_{T-1}, r_T)$ by sampling $a_t \sim \pi_\theta(\cdot\mid s_t)$ until termination.
2. For each time step $t = 0, 1, \ldots, T-1$ :
  1. Compute the return-to-go $G_t \leftarrow \sum_{k=t}^{T-1}\gamma^{k-t} r_{k+1}$ .
3. Update the parameters with one gradient-ascent step:
  $\theta \leftarrow \theta + \alpha \sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t\mid s_t)\,G_t.$
  (17)

Reducing Variance with a Baseline¶

The REINFORCE estimator is unbiased but, as noted, can have high variance. A simple and powerful variance-reduction trick is to subtract a baseline $b(s_t)$ from each return:

\nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim \pi_\theta}\left[\sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t\mid s_t)\,(G_t - b(s_t))\right].

(18)

The baseline must depend only on the state, not on the action $a_t$ . Under that condition it leaves the gradient unbiased:

\mathbb{E}_{a\sim\pi_\theta(\cdot\mid s)}\left[\nabla_\theta \log \pi_\theta(a\mid s)\,b(s)\right] = b(s)\,\nabla_\theta \sum_a \pi_\theta(a\mid s) = b(s)\,\nabla_\theta 1 = 0,

(19)

since the score function has zero mean under $\pi_\theta$ . A natural choice is the state-value function $b(s) = V^{\pi_\theta}(s)$ , in which case $G_t - V^{\pi_\theta}(s_t)$ becomes an estimate of the advantage $A^{\pi_\theta}(s_t, a_t)$ . Replacing the Monte-Carlo return $G_t$ with a learned estimate of the advantage is exactly the step that takes us from REINFORCE to actor-critic methods Greensmith et al., 2004.

Practical Considerations (What People Actually Do)¶

Even for REINFORCE, implementations typically add a few pragmatic ingredients:

Batching episodes. Gradient estimates become more stable if we average over multiple trajectories before updating.
Entropy regularization. Adding an entropy bonus encourages exploration and prevents premature collapse to a near-deterministic policy.
Return normalization. Normalizing the batch of returns (or advantages) to zero mean and unit variance is a common heuristic that can improve optimization stability.

Limitations¶

REINFORCE makes the core idea of policy gradients visible, but it is rarely used as a final algorithm:

Its gradient estimates can have very high variance, even with baselines Greensmith et al., 2004.
It is typically on-policy, so it can be data-hungry.

Summary¶

This chapter introduced policy-based reinforcement learning and derived the REINFORCE algorithm:

Policy-based methods optimize a parameterized policy $\pi_\theta(a\mid s)$ directly, rather than deriving it from a learned value function.
Policy-gradient methods perform gradient ascent on expected return $J(\theta)$ using sampled trajectories.
Using the log-likelihood trick, the gradient depends only on $\nabla_\theta \log \pi_\theta(a\mid s)$ and not on differentiating the environment dynamics Williams, 1992.
REINFORCE updates the policy using Monte-Carlo returns (or reward-to-go) and is unbiased but high-variance.
Baselines preserve unbiasedness and can greatly reduce variance, motivating advantages and actor-critic methods Greensmith et al., 2004.

In the next chapter we will generalize this idea via the policy gradient theorem and replace the Monte-Carlo return with a learned critic, leading to actor-critic methods.

References¶

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). Continuous Control with Deep Reinforcement Learning. Proceedings of the 4th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1509.02971
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
Chrisman, L. (1992). Reinforcement Learning with Perceptual Aliasing: The Perceptual Distinctions Approach. Proceedings of the Tenth National Conference on Artificial Intelligence, 183–188. https://aaai.org/papers/00183-aaai92-029-reinforcement-learning-with-perceptual-aliasing-the-perceptual-distinctions-approach/
Singh, S. P., Jaakkola, T., & Jordan, M. I. (1994). Learning Without State-Estimation in Partially Observable Markovian Decision Processes. In W. W. Cohen & H. Hirsh (Eds.), Proceedings of the Eleventh International Conference on Machine Learning (ICML) (pp. 284–292). Morgan Kaufmann. http://www.eecs.umich.edu/~baveja/Papers/ML94.pdf
Hugging Face. (2024). Hugging Face Deep Reinforcement Learning Course. Online course. https://huggingface.co/learn/deep-rl-course/en/unit0/introduction
Williams, R. J. (1992). Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8(3–4), 229–256. 10.1007/BF00992696
Watkins, C. J. C. H., & Dayan, P. (1992). Q-Learning. Machine Learning, 8(3–4), 279–292. 10.1007/BF00992698
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-Level Control through Deep Reinforcement Learning. Nature, 518(7540), 529–533. 10.1038/nature14236
Salimans, T., Ho, J., Chen, X., Sidor, S., & Sutskever, I. (2017). Evolution Strategies as a Scalable Alternative to Reinforcement Learning. arXiv Preprint arXiv:1703.03864. https://arxiv.org/abs/1703.03864
Mania, H., Guy, A., & Recht, B. (2018). Simple Random Search of Static Linear Policies is Competitive for Reinforcement Learning. Advances in Neural Information Processing Systems (NeurIPS), 31. https://proceedings.neurips.cc/paper_files/paper/2018/hash/7634ea65a4e6d9041cfd3f7de18e334a-Abstract.html
Greensmith, E., Bartlett, P. L., & Baxter, J. (2004). Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning. Journal of Machine Learning Research, 5, 1471–1530. https://www.jmlr.org/papers/v5/greensmith04a.html