Actor-Critic Methods
Policy Gradients with a Learned Critic — A2C and DDPG
The previous chapter showed that we can optimize a parameterized policy directly by gradient ascent on expected return, using the score-function estimator Williams, 1992. REINFORCE is unbiased, conceptually clean, and the gradient can be calculated without differentiating the environment. It also taught us, by example, why nobody ships REINFORCE in production: a single Monte-Carlo return is a very noisy estimate of how good action was, and that noise propagates into the gradient. We saw at the end of the last chapter that subtracting a state-only baseline leaves the gradient unbiased while reducing variance, and that the natural baseline is the state-value function — turning the multiplier into the advantage .
This chapter takes the next step: replace the Monte-Carlo return with a learned critic. The result is the actor-critic family, which sits at the heart of modern deep RL Konda & Tsitsiklis, 1999Sutton & Barto, 2018Hugging Face, 2024. We will state the policy gradient theorem that makes this substitution legal in expectation, derive the simplest one-step actor-critic update, and then look at two of the most important deep instantiations: A2C Mnih et al., 2016 for discrete and continuous control with stochastic policies, and DDPG Lillicrap et al., 2016 — built on the deterministic policy gradient theorem Silver et al., 2014 — for continuous control with a deterministic actor.
Slides for this chapter (open full screen).
The Variance Problem in REINFORCE¶
Recall the REINFORCE-with-baseline gradient estimator:
The estimator is unbiased, but is a whole-episode sum of stochastic rewards. Two distinct sources of randomness pile up inside it: the environment’s transitions and rewards, and the policy’s own action sampling. Two trajectories that started in the same state and took the same action can produce wildly different returns, just because the rest of the episode unfolded differently. This is the well-documented variance problem of Monte-Carlo policy gradients Williams, 1992Greensmith et al., 2004Sutton & Barto, 2018: the estimator is unbiased, but driving the noise down requires many trajectories per gradient step — and that is precisely the sample inefficiency that motivates everything in this chapter.
A learned baseline helps, but only by removing the state-dependent part of the variance — the part of that is “expected before the action is taken.” It does nothing about the variance contributed by the rest of the trajectory after . The natural next step is to replace the entire with a learned estimate.
The Policy Gradient Theorem¶
The REINFORCE gradient is one form of — the one that uses a sampled return. There is a more general statement, the policy gradient theorem Sutton et al., 2000, that holds for any sufficiently nice MDP and any differentiable parameterization :
where is the (discounted) state-visitation distribution induced by and is the action-value function under . The Monte-Carlo estimator (1) is recovered by substituting as an unbiased sample of .
The theorem matters because the multiplier is a function of alone — it does not depend on the rest of the trajectory. If we had access to exactly, we would get a much lower-variance estimator than , at no cost in bias. We don’t — but we can learn an approximation.
Subtracting a baseline gives the advantage form¶
Adding any function that depends only on the state to the multiplier in (2) leaves the expectation unchanged, by exactly the same argument used for REINFORCE in the previous chapter. A natural and effective choice is — it captures the state-dependent component of the return that every action shares — which yields the advantage form:
The advantage answers a more focused question than alone: how much better is action than what the policy would do on average from state ? Centering by removes the part of that is the same for every action and thus carries no useful gradient signal. is the de facto standard baseline in modern actor-critic methods, though it is not strictly variance-minimizing: Greensmith et al. (2004) show that the optimal scalar baseline is a score-norm-weighted version of , and that even using the true value function as the critic can be suboptimal. In practice the simplification to is universal because it ties cleanly into the value-function machinery we already need.
Actor-Critic: The General Recipe¶
Actor-critic methods turn the advantage form (3) into an algorithm by parameterizing two networks:
An actor — the policy we ultimately care about.
A critic (or sometimes ) — a learned approximation of the policy’s value function.
Think of the actor as a player and the critic as a friend watching over their shoulder, telling them after each move whether things went better or worse than expected Hugging Face, 2024. The actor uses that feedback to push the probability of good actions up and bad ones down; the critic, in parallel, fits its predictions to what actually happens.
TD-bootstrapped advantage¶
We almost never have in closed form. The simplest replacement is the one-step temporal-difference (TD) error:
Two facts make a useful surrogate for the advantage. First, conditioning on and substituting the true value function for ,
so is a one-sample estimate of the advantage. Second, it uses a bootstrap: the right-hand side mixes one real reward with the critic’s own prediction at the next state. Bootstrapping introduces bias when , but it dramatically reduces variance compared to the full Monte-Carlo return , which sums many noisy reward terms. This is the same bias/variance trade-off that distinguishes TD from MC for value-based methods Sutton & Barto, 2018, now applied to the critic of a policy-gradient method.
Figure 2:The actor-critic architecture. The actor samples actions; the critic scores states. The bootstrap disagreement between the critic’s two predictions and the observed reward — the TD error — provides both an advantage signal for the actor (blue update) and a regression target for the critic (orange update).
One-step actor-critic update¶
With playing the role of , each environment step yields two updates:
The actor update (6) is exactly REINFORCE — but with in place of . The critic update (7) is semi-gradient TD(0): we move toward the bootstrapped target , treating the target as fixed.
Putting these together gives the one-step actor-critic algorithm:
Initialize actor parameters and critic parameters . Choose learning rates and discount .
For each environment step:
Observe state . Sample action .
Execute ; observe reward and next state .
Compute the TD error .
Update the actor: .
Update the critic: .
Figure 3:The one-step actor-critic loop. Each environment step produces a TD error that simultaneously serves as an advantage estimate for the actor (blue) and a regression residual for the critic (orange). Compare with the REINFORCE flow: we no longer wait for the episode to end before updating.
The loop runs entirely online — no episode boundary required, no Monte-Carlo return to compute. This is the structural change that makes actor-critic substantially more sample-efficient than REINFORCE.
A subtle point: this one-step actor-critic is already a complete algorithm. We could run it on a single environment, with single-sample gradients, and it would (slowly) work. The deep-RL twist that we will see next — collecting transitions in parallel from environments before each gradient step — is a practical fix for the noisy-single-sample problem when training neural networks, not part of the actor-critic recipe itself.
A spectrum of advantage estimators¶
The one-step TD error and the full Monte-Carlo return are the extremes of a continuum. Between them sit the n-step returns
and generalized advantage estimation (GAE) Schulman et al., 2016, which exponentially averages all -step advantages by a parameter . As (or ) increases, the estimator uses more real reward and less bootstrap, trading bias down and variance up. Most modern implementations of A2C and PPO use GAE.
A2C — Advantage Actor-Critic¶
The most pedagogically clean deep actor-critic algorithm is A2C, the synchronous variant of Mnih et al. (2016)’s A3C (“asynchronous advantage actor-critic”). A3C used many independent worker processes that asynchronously updated a shared parameter server with their own rollouts. A2C — popularized by OpenAI’s baselines release OpenAI, 2017 and used as the policy-gradient backbone of Wu et al. (2017)’s ACKTR — instead runs environments in lockstep and aggregates their data into one synchronous gradient step per round. Empirically, the synchronous variant matches or exceeds A3C while making far better use of GPU batching OpenAI, 2017.
The algorithm¶
Each iteration:
Reset (if needed) parallel environments. Run the current policy in all of them for steps, producing a batch of transitions.
For each step in the batch, compute the bootstrapped -step return target and advantage estimate using the current critic :
or the corresponding GAE estimate Schulman et al., 2016.
Compute the combined loss, summed across the batch:
where is the entropy of the action distribution (the differential entropy in the continuous case).
Take one gradient step on with respect to . The same updated parameters are then used for the next round of rollouts in all environments.
The coefficients are hyperparameters; common defaults inherited from A3C and the OpenAI baselines implementations are and Mnih et al., 2016OpenAI, 2017.
Figure 4:A2C runs environments synchronously, each taking steps with the current policy. The combined batch of transitions yields one gradient update; the updated shared policy is then used by all environments for the next round. Parallelism alone provides the sample decorrelation that DQN gets from a replay buffer.
Why parallel environments?¶
A2C is on-policy: the gradient estimate (10) is only valid if the data was collected under the current . That rules out a DQN-style replay buffer of stale transitions. But on-policy training from a single environment leaves us with the same problem REINFORCE had — every gradient step depends on a small number of highly correlated trajectories. Running environments in parallel gives us a near-i.i.d.-looking batch at the current policy, which is enough to make stochastic gradient descent behave well Mnih et al., 2016. It is also a particularly clean fit for modern accelerators: parallel rollouts can be batched into a single GPU forward pass.
The entropy bonus¶
The term in (10) is a small positive reward for keeping the policy distribution spread out. Without it, an actor-critic agent that finds even a moderately good action tends to collapse to a near-deterministic policy too quickly, before the critic has had a chance to evaluate alternatives. The entropy bonus is a soft exploration mechanism; the same idea reappears prominently in PPO and SAC.
Continuous Control: The Deterministic Policy Gradient Theorem¶
A2C handles continuous actions by sampling from a Gaussian policy, just like the discrete case but with a different output head — and this works well in practice (PPO with Gaussian heads remains a strong baseline for continuous control). A different idea, going back to Silver et al. (2014), is to abandon stochastic policies for control and learn a deterministic mapping directly, training it by gradient flow through a critic that knows how the value changes with the action. This trades the variance from action sampling for a different, often smaller, source of variance — an estimate of — and, as we will see, makes off-policy training natural.
This is, of course, what was hard to do in DQN: solving at every step. The deterministic policy gradient theorem makes the substitution work because the actor is differentiable, not because the argmax becomes any easier.
The theorem¶
Let be a deterministic policy with parameters , the action-value function under , and the discounted state-visitation distribution under — the same object as in (2), written here to follow the notation of Silver et al. (2014). Under standard regularity conditions, Silver et al. (2014) show that
Read (11) as a chain rule. Define a scalar objective . Differentiating through the actor’s output gives, pointwise in ,
which is exactly the integrand on the right-hand side of (11). The “policy gradient” reduces to: for each state, find the direction in action space that increases , then ask the actor’s parameters to move the action that way.
Off-policy is free¶
A second consequence: because is deterministic, the action that the behavior policy actually took does not appear in the gradient — only the state appears, and the actor’s own current output is plugged into . We can therefore evaluate the integrand on data collected by any behavior policy without an importance-sampling correction. That is the key to making the algorithm off-policy and replay-buffer-friendly, which is exactly what we want for sample efficiency in continuous control.
DDPG — Deep Deterministic Policy Gradient¶
DDPG is what you get when you combine the deterministic policy gradient theorem with the two stabilizing tricks from DQN Lillicrap et al., 2016:
A replay buffer of transitions, sampled uniformly at random for each update — possible because the DPG-based actor gradient is off-policy.
Target networks for both actor and critic, updated by Polyak averaging Polyak & Juditsky, 1992Lillicrap et al., 2016 rather than DQN’s hard copy:
with small (e.g. 0.005) applied every gradient step. This produces a slowly tracking target whose updates are smoother than DQN’s periodic hard sync. (We re-use the symbol — distinct from the trajectory of the previous chapter; context disambiguates.)
A third practical ingredient is needed because the policy is now deterministic and provides no exploration on its own: the agent acts in the environment with additive exploration noise,
where is drawn from a Gaussian or a temporally-correlated Ornstein–Uhlenbeck process Lillicrap et al., 2016. The resulting noisy actions are stored in the replay buffer; the clean deterministic action is used inside the gradient.
Figure 5:DDPG combines a deterministic actor with a Q-critic, trained off-policy from a replay buffer with target-network smoothing. The actor is updated by passing the critic’s action-gradient back through the chain rule (DPG theorem); the critic is trained on TD targets formed with the target actor and critic. Exploration is added as noise to the deterministic action at environment-interaction time.
The algorithm¶
Putting all the pieces together gives the DDPG algorithm of Lillicrap et al. (2016):
Initialize actor and critic with random parameters; copy them to targets , . Initialize an empty replay buffer .
For each environment step:
Observe state . Take action with exploration noise .
Execute ; observe reward , next state , and termination flag . Store in .
Sample a random mini-batch of transitions from .
Compute the TD target for each transition:
Take a gradient step on the critic loss
Take a gradient step on the actor with the DPG estimator
Soft-update both targets: , .
What DDPG inherited and what it changed¶
| From DQN Mnih et al., 2015 | From DPG Silver et al., 2014 |
|---|---|
| Replay buffer for sample efficiency and decorrelation | Deterministic actor |
| Target networks for stable bootstrapping | Off-policy, IS-correction-free actor gradient |
| Mini-batch SGD on a TD-style critic loss | Critic gradient flows back into actor parameters |
DDPG was the algorithm that made deep RL competitive on continuous-control benchmarks (MuJoCo, simulated robotics) for the first time Lillicrap et al., 2016, and it remains the conceptual backbone of all modern continuous-control actor-critics — TD3 Fujimoto et al., 2018 fixes its overestimation bias with clipped double-Q targets, delayed policy updates, and target-policy smoothing; SAC Haarnoja et al., 2018 swaps the deterministic actor for a maximum-entropy stochastic one — but the off-policy actor-critic-with-replay structure is the same.
Summary¶
Actor-critic methods combine value estimation with direct policy optimization through gradients:
The policy gradient theorem lets us replace the Monte-Carlo return with any unbiased estimate of — including a learned critic Sutton et al., 2000.
The advantage is the canonical low-variance multiplier for the policy gradient; the one-step TD error is the simplest practical estimate of it.
A2C Mnih et al., 2016 is the synchronous deep instantiation: parallel on-policy rollouts, n-step or GAE Schulman et al., 2016 advantages, a combined policy/value/entropy loss. It is the natural deep extension of REINFORCE-with-a-baseline.
The deterministic policy gradient theorem Silver et al., 2014 turns the actor’s gradient into a chain rule through the critic — no log-probability, no importance sampling, off-policy by construction.
DDPG Lillicrap et al., 2016 applies the DPG theorem with DQN’s replay buffer and target networks to give the first stable, sample-efficient continuous-control deep RL algorithm.
In the next chapter we will look at PPO Schulman et al., 2017, the on-policy stochastic actor-critic that has become the de facto default in deep RL — the algorithm behind everything from robot manipulation to instruction-tuning of large language models Ouyang et al., 2022 — and see how its clipped surrogate objective addresses the step-size problem that A2C does not.
- Williams, R. J. (1992). Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8(3–4), 229–256. 10.1007/BF00992696
- Konda, V. R., & Tsitsiklis, J. N. (1999). Actor-Critic Algorithms. Advances in Neural Information Processing Systems, 12, 1008–1014. https://papers.nips.cc/paper_files/paper/1999/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
- Hugging Face. (2024). Hugging Face Deep Reinforcement Learning Course. Online course. https://huggingface.co/learn/deep-rl-course/en/unit0/introduction
- Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous Methods for Deep Reinforcement Learning. Proceedings of the 33rd International Conference on Machine Learning (ICML), 48, 1928–1937. https://arxiv.org/abs/1602.01783
- Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). Continuous Control with Deep Reinforcement Learning. Proceedings of the 4th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1509.02971
- Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014). Deterministic Policy Gradient Algorithms. Proceedings of the 31st International Conference on Machine Learning (ICML), 32, 387–395. https://proceedings.mlr.press/v32/silver14.html
- Greensmith, E., Bartlett, P. L., & Baxter, J. (2004). Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning. Journal of Machine Learning Research, 5, 1471–1530. https://www.jmlr.org/papers/v5/greensmith04a.html
- Schulman, J., Moritz, P., Levine, S., Jordan, M. I., & Abbeel, P. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. Proceedings of the 4th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1506.02438
- Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy Gradient Methods for Reinforcement Learning with Function Approximation. Advances in Neural Information Processing Systems, 12, 1057–1063. https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-a
- Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons.
- OpenAI. (2017). OpenAI Baselines: ACKTR & A2C. OpenAI Blog. https://openai.com/index/openai-baselines-acktr-a2c/
- Wu, Y., Mansimov, E., Liao, S., Grosse, R., & Ba, J. (2017). Scalable Trust-Region Method for Deep Reinforcement Learning Using Kronecker-Factored Approximation. Advances in Neural Information Processing Systems 30 (NeurIPS). https://arxiv.org/abs/1708.05144
- Polyak, B. T., & Juditsky, A. B. (1992). Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization, 30(4), 838–855. 10.1137/0330046
- Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-Level Control through Deep Reinforcement Learning. Nature, 518(7540), 529–533. 10.1038/nature14236