Actor-Critic Methods - Introduction to Deep Reinforcement Learning

The previous chapter showed that we can optimize a parameterized policy $\pi_\theta(a\mid s)$ directly by gradient ascent on expected return, using the score-function estimator Williams, 1992. REINFORCE is unbiased, conceptually clean, and the gradient can be calculated without differentiating the environment. It also taught us, by example, why nobody ships REINFORCE in production: a single Monte-Carlo return $G_t$ is a very noisy estimate of how good action $a_t$ was, and that noise propagates into the gradient. We saw at the end of the last chapter that subtracting a state-only baseline $b(s_t)$ leaves the gradient unbiased while reducing variance, and that the natural baseline is the state-value function $V^{\pi}(s_t)$ — turning the multiplier $G_t$ into the advantage $A^{\pi}(s_t,a_t) = G_t - V^{\pi}(s_t)$ .

This chapter takes the next step: replace the Monte-Carlo return with a learned critic. The result is the actor-critic family, which sits at the heart of modern deep RL Konda & Tsitsiklis, 1999Sutton & Barto, 2018Hugging Face, 2024. We will state the policy gradient theorem that makes this substitution legal in expectation, derive the simplest one-step actor-critic update, and then look at two of the most important deep instantiations: A2C Mnih et al., 2016 for discrete and continuous control with stochastic policies, and DDPG Lillicrap et al., 2016 — built on the deterministic policy gradient theorem Silver et al., 2014 — for continuous control with a deterministic actor.

Slides for this chapter (open full screen).

The Variance Problem in REINFORCE¶

Recall the REINFORCE-with-baseline gradient estimator:

\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t\mid s_t)\,\bigl(G_t - V^{\pi}(s_t)\bigr)\right].

(1)

The estimator is unbiased, but $G_t$ is a whole-episode sum of stochastic rewards. Two distinct sources of randomness pile up inside it: the environment’s transitions and rewards, and the policy’s own action sampling. Two trajectories that started in the same state $s_t$ and took the same action $a_t$ can produce wildly different returns, just because the rest of the episode unfolded differently. This is the well-documented variance problem of Monte-Carlo policy gradients Williams, 1992Greensmith et al., 2004Sutton & Barto, 2018: the estimator is unbiased, but driving the noise down requires many trajectories per gradient step — and that is precisely the sample inefficiency that motivates everything in this chapter.

A learned baseline $V_w(s_t)$ helps, but only by removing the state-dependent part of the variance — the part of $G_t$ that is “expected before the action is taken.” It does nothing about the variance contributed by the rest of the trajectory after $a_t$ . The natural next step is to replace the entire $G_t$ with a learned estimate.

The Policy Gradient Theorem¶

The REINFORCE gradient is one form of $\nabla_\theta J(\theta)$ — the one that uses a sampled return. There is a more general statement, the policy gradient theorem Sutton et al., 2000, that holds for any sufficiently nice MDP and any differentiable parameterization $\pi_\theta$ :

\nabla_\theta J(\theta) = \mathbb{E}_{s\sim d^{\pi_\theta},\, a\sim \pi_\theta(\cdot\mid s)}\left[ \nabla_\theta \log \pi_\theta(a\mid s)\, Q^{\pi_\theta}(s,a) \right],

(2)

where $d^{\pi_\theta}(s)$ is the (discounted) state-visitation distribution induced by $\pi_\theta$ and $Q^{\pi_\theta}(s,a)$ is the action-value function under $\pi_\theta$ . The Monte-Carlo estimator (1) is recovered by substituting $G_t$ as an unbiased sample of $Q^{\pi_\theta}(s_t,a_t)$ .

Note

“Sufficiently nice” is shorthand for a short list of regularity conditions Sutton et al., 2000Sutton & Barto, 2018Konda & Tsitsiklis, 1999: rewards are bounded (so $V^{\pi_\theta}$ and $Q^{\pi_\theta}$ are finite for every $(s,a)$ ), the discount satisfies $\gamma \in [0,1)$ — or, in the average-reward formulation, the Markov chain induced by $\pi_\theta$ is ergodic so that a unique stationary distribution $d^{\pi_\theta}$ exists Puterman, 1994 — and the parameterization $\pi_\theta(a\mid s)$ is differentiable in $\theta$ with enough integrability that $\nabla_\theta$ commutes with the sum or integral over states. In finite MDPs the last condition is automatic; for continuous state/action spaces it follows from the dominated-convergence-type smoothness assumptions used by Silver et al. (2014) in the appendix of the deterministic-policy version. In ordinary deep-RL practice — softmax or Gaussian policies, $\gamma$ strictly less than 1, bounded or clipped rewards, episodic environments with resets — all of these hold trivially. They start to bite only at edge cases: divergent returns from unbounded rewards, transient states in a non-ergodic continuing chain, or non-differentiable policies such as a hard $\arg\max$ . The deterministic case ( $\nabla_\theta\log\pi_\theta$ no longer makes sense) requires the separate theorem of Silver et al. (2014) that we use later in this chapter.

The theorem matters because the multiplier $Q^{\pi_\theta}(s,a)$ is a function of $(s,a)$ alone — it does not depend on the rest of the trajectory. If we had access to $Q^{\pi_\theta}$ exactly, we would get a much lower-variance estimator than $G_t$ , at no cost in bias. We don’t — but we can learn an approximation.

Subtracting a baseline gives the advantage form¶

Adding any function $b(s)$ that depends only on the state to the multiplier in (2) leaves the expectation unchanged, by exactly the same argument used for REINFORCE in the previous chapter. A natural and effective choice is $b(s) = V^{\pi_\theta}(s)$ — it captures the state-dependent component of the return that every action shares — which yields the advantage form:

\nabla_\theta J(\theta) = \mathbb{E}_{s,a}\left[\nabla_\theta \log \pi_\theta(a\mid s)\, A^{\pi_\theta}(s,a)\right], \qquad A^{\pi_\theta}(s,a) = Q^{\pi_\theta}(s,a) - V^{\pi_\theta}(s).

(3)

The advantage answers a more focused question than $Q^{\pi_\theta}$ alone: how much better is action $a$ than what the policy would do on average from state $s$ ? Centering by $V^{\pi_\theta}$ removes the part of $Q^{\pi_\theta}$ that is the same for every action and thus carries no useful gradient signal. $V^{\pi_\theta}$ is the de facto standard baseline in modern actor-critic methods, though it is not strictly variance-minimizing: Greensmith et al. (2004) show that the optimal scalar baseline is a score-norm-weighted version of $Q^{\pi_\theta}$ , and that even using the true value function as the critic can be suboptimal. In practice the simplification to $V^{\pi_\theta}$ is universal because it ties cleanly into the value-function machinery we already need.

Actor-Critic: The General Recipe¶

Actor-critic methods turn the advantage form (3) into an algorithm by parameterizing two networks:

An actor $\pi_\theta(a\mid s)$ — the policy we ultimately care about.
A critic $V_w(s)$ (or sometimes $Q_w(s,a)$ ) — a learned approximation of the policy’s value function.

Think of the actor as a player and the critic as a friend watching over their shoulder, telling them after each move whether things went better or worse than expected Hugging Face, 2024. The actor uses that feedback to push the probability of good actions up and bad ones down; the critic, in parallel, fits its predictions to what actually happens.

TD-bootstrapped advantage¶

We almost never have $A^{\pi_\theta}$ in closed form. The simplest replacement is the one-step temporal-difference (TD) error:

\delta_t = r_{t+1} + \gamma\, V_w(s_{t+1}) - V_w(s_t).

(4)

Two facts make $\delta_t$ a useful surrogate for the advantage. First, conditioning on $(s_t, a_t)$ and substituting the true value function $V^{\pi_\theta}$ for $V_w$ ,

\mathbb{E}\bigl[r_{t+1} + \gamma V^{\pi_\theta}(s_{t+1}) - V^{\pi_\theta}(s_t)\;\big|\;s_t,a_t\bigr] = Q^{\pi_\theta}(s_t,a_t) - V^{\pi_\theta}(s_t) = A^{\pi_\theta}(s_t,a_t),

(5)

so $\delta_t$ is a one-sample estimate of the advantage. Second, it uses a bootstrap: the right-hand side mixes one real reward with the critic’s own prediction at the next state. Bootstrapping introduces bias when $V_w \neq V^{\pi_\theta}$ , but it dramatically reduces variance compared to the full Monte-Carlo return $G_t$ , which sums many noisy reward terms. This is the same bias/variance trade-off that distinguishes TD from MC for value-based methods Sutton & Barto, 2018, now applied to the critic of a policy-gradient method.

The actor-critic architecture. The actor \pi_\theta(a\mid s) samples actions; the critic V_w(s) scores states. The bootstrap disagreement between the critic’s two predictions and the observed reward — the TD error \delta_t — provides both an advantage signal for the actor (blue update) and a regression target for the critic (orange update). — Figure 2:The actor-critic architecture. The actor $\pi_\theta(a\mid s)$ samples actions; the critic $V_w(s)$ scores states. The bootstrap disagreement between the critic’s two predictions and the observed reward — the TD error $\delta_t$ — provides both an advantage signal for the actor (blue update) and a regression target for the critic (orange update).

One-step actor-critic update¶

With $\delta_t$ playing the role of $A^{\pi_\theta}(s_t,a_t)$ , each environment step yields two updates:

\theta \leftarrow \theta + \alpha_\pi\, \nabla_\theta \log \pi_\theta(a_t\mid s_t)\, \delta_t,

(6)

w \leftarrow w + \alpha_V\, \delta_t\, \nabla_w V_w(s_t).

(7)

The actor update (6) is exactly REINFORCE — but with $\delta_t$ in place of $G_t - b(s_t)$ . The critic update (7) is semi-gradient TD(0): we move $V_w(s_t)$ toward the bootstrapped target $r_{t+1} + \gamma V_w(s_{t+1})$ , treating the target as fixed.

Putting these together gives the one-step actor-critic algorithm:

Initialize actor parameters $\theta$ and critic parameters $w$ . Choose learning rates $\alpha_\pi, \alpha_V > 0$ and discount $\gamma$ .
For each environment step:
1. Observe state $s_t$ . Sample action $a_t \sim \pi_\theta(\cdot\mid s_t)$ .
2. Execute $a_t$ ; observe reward $r_{t+1}$ and next state $s_{t+1}$ .
3. Compute the TD error $\delta_t = r_{t+1} + \gamma V_w(s_{t+1}) - V_w(s_t)$ .
4. Update the actor: $\theta \leftarrow \theta + \alpha_\pi\, \nabla_\theta \log \pi_\theta(a_t\mid s_t)\, \delta_t$ .
5. Update the critic: $w \leftarrow w + \alpha_V\, \delta_t\, \nabla_w V_w(s_t)$ .

The one-step actor-critic loop. Each environment step produces a TD error \delta_t that simultaneously serves as an advantage estimate for the actor (blue) and a regression residual for the critic (orange). Compare with the REINFORCE flow: we no longer wait for the episode to end before updating. — Figure 3:The one-step actor-critic loop. Each environment step produces a TD error $\delta_t$ that simultaneously serves as an advantage estimate for the actor (blue) and a regression residual for the critic (orange). Compare with the REINFORCE flow: we no longer wait for the episode to end before updating.

The loop runs entirely online — no episode boundary required, no Monte-Carlo return to compute. This is the structural change that makes actor-critic substantially more sample-efficient than REINFORCE.

A subtle point: this one-step actor-critic is already a complete algorithm. We could run it on a single environment, with single-sample gradients, and it would (slowly) work. The deep-RL twist that we will see next — collecting transitions in parallel from $N$ environments before each gradient step — is a practical fix for the noisy-single-sample problem when training neural networks, not part of the actor-critic recipe itself.

A spectrum of advantage estimators¶

The one-step TD error and the full Monte-Carlo return are the extremes of a continuum. Between them sit the n-step returns

G^{(n)}_t = r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{n-1} r_{t+n} + \gamma^n V_w(s_{t+n}),

(8)

and generalized advantage estimation (GAE) Schulman et al., 2016, which exponentially averages all $n$ -step advantages by a parameter $\lambda \in [0,1]$ . As $n$ (or $\lambda$ ) increases, the estimator uses more real reward and less bootstrap, trading bias down and variance up. Most modern implementations of A2C and PPO use GAE.

A2C — Advantage Actor-Critic¶

The most pedagogically clean deep actor-critic algorithm is A2C, the synchronous variant of Mnih et al. (2016)’s A3C (“asynchronous advantage actor-critic”). A3C used many independent worker processes that asynchronously updated a shared parameter server with their own rollouts. A2C — popularized by OpenAI’s baselines release OpenAI, 2017 and used as the policy-gradient backbone of Wu et al. (2017)’s ACKTR — instead runs $N$ environments in lockstep and aggregates their data into one synchronous gradient step per round. Empirically, the synchronous variant matches or exceeds A3C while making far better use of GPU batching OpenAI, 2017.

The algorithm¶

Each iteration:

Reset (if needed) $N$ parallel environments. Run the current policy $\pi_\theta$ in all of them for $n$ steps, producing a batch of $N \times n$ transitions.
For each step in the batch, compute the bootstrapped $n$ -step return target and advantage estimate using the current critic $V_w$ :
$\hat R_t = \sum_{k=0}^{n-1}\gamma^k r_{t+k+1} + \gamma^n V_w(s_{t+n}), \qquad \hat A_t = \hat R_t - V_w(s_t),$
(9)
or the corresponding GAE estimate Schulman et al., 2016.
Compute the combined loss, summed across the batch:
$\mathcal{L}(\theta, w) = -\underbrace{\sum_t \log \pi_\theta(a_t\mid s_t)\,\hat A_t}_{\text{policy loss}} + c_V \underbrace{\sum_t \bigl(\hat R_t - V_w(s_t)\bigr)^2}_{\text{value loss}} - c_H \underbrace{\sum_t H\bigl(\pi_\theta(\cdot\mid s_t)\bigr)}_{\text{entropy bonus}},$
(10)
where $H(\pi(\cdot\mid s)) = -\sum_a \pi(a\mid s)\log\pi(a\mid s)$ is the entropy of the action distribution (the differential entropy in the continuous case).
Take one gradient step on $\mathcal{L}$ with respect to $(\theta, w)$ . The same updated parameters are then used for the next round of rollouts in all $N$ environments.

The coefficients $c_V, c_H > 0$ are hyperparameters; common defaults inherited from A3C and the OpenAI baselines implementations are $c_V \approx 0.5$ and $c_H \approx 0.01$ Mnih et al., 2016OpenAI, 2017.

A2C runs N environments synchronously, each taking n steps with the current policy. The combined batch of N \times n transitions yields one gradient update; the updated shared policy is then used by all environments for the next round. Parallelism alone provides the sample decorrelation that DQN gets from a replay buffer. — Figure 4:A2C runs $N$ environments synchronously, each taking $n$ steps with the current policy. The combined batch of $N \times n$ transitions yields one gradient update; the updated shared policy is then used by all environments for the next round. Parallelism alone provides the sample decorrelation that DQN gets from a replay buffer.

Why parallel environments?¶

A2C is on-policy: the gradient estimate (10) is only valid if the data was collected under the current $\pi_\theta$ . That rules out a DQN-style replay buffer of stale transitions. But on-policy training from a single environment leaves us with the same problem REINFORCE had — every gradient step depends on a small number of highly correlated trajectories. Running $N$ environments in parallel gives us a near-i.i.d.-looking batch at the current policy, which is enough to make stochastic gradient descent behave well Mnih et al., 2016. It is also a particularly clean fit for modern accelerators: $N$ parallel rollouts can be batched into a single GPU forward pass.

The entropy bonus¶

The $-c_H H(\pi_\theta(\cdot\mid s_t))$ term in (10) is a small positive reward for keeping the policy distribution spread out. Without it, an actor-critic agent that finds even a moderately good action tends to collapse to a near-deterministic policy too quickly, before the critic has had a chance to evaluate alternatives. The entropy bonus is a soft exploration mechanism; the same idea reappears prominently in PPO and SAC.

Continuous Control: The Deterministic Policy Gradient Theorem¶

A2C handles continuous actions by sampling from a Gaussian policy, just like the discrete case but with a different output head — and this works well in practice (PPO with Gaussian heads remains a strong baseline for continuous control). A different idea, going back to Silver et al. (2014), is to abandon stochastic policies for control and learn a deterministic mapping $a = \mu_\theta(s)$ directly, training it by gradient flow through a critic that knows how the value changes with the action. This trades the variance from action sampling for a different, often smaller, source of variance — an estimate of $\nabla_a Q$ — and, as we will see, makes off-policy training natural.

This is, of course, what was hard to do in DQN: solving $\arg\max_a Q_\theta(s,a)$ at every step. The deterministic policy gradient theorem makes the substitution work because the actor is differentiable, not because the argmax becomes any easier.

The theorem¶

Let $\mu_\theta : \mathcal{S} \to \mathcal{A}$ be a deterministic policy with parameters $\theta$ , $Q^\mu(s,a)$ the action-value function under $\mu_\theta$ , and $\rho^\mu(s)$ the discounted state-visitation distribution under $\mu_\theta$ — the same object as $d^{\pi_\theta}$ in (2), written $\rho^\mu$ here to follow the notation of Silver et al. (2014). Under standard regularity conditions, Silver et al. (2014) show that

\nabla_\theta J(\mu_\theta) = \mathbb{E}_{s\sim \rho^\mu}\left[ \nabla_\theta\, \mu_\theta(s)\, \nabla_a Q^\mu(s,a)\big|_{a = \mu_\theta(s)} \right].

(11)

Read (11) as a chain rule. Define a scalar objective $J(\theta) = \mathbb{E}_{s\sim\rho^\mu}\bigl[Q^\mu(s, \mu_\theta(s))\bigr]$ . Differentiating through the actor’s output gives, pointwise in $s$ ,

\nabla_\theta\, Q^\mu(s, \mu_\theta(s)) = \nabla_\theta \mu_\theta(s)\, \nabla_a Q^\mu(s, a)\big|_{a=\mu_\theta(s)},

(12)

which is exactly the integrand on the right-hand side of (11). The “policy gradient” reduces to: for each state, find the direction in action space that increases $Q$ , then ask the actor’s parameters to move the action that way.

Off-policy is free¶

A second consequence: because $\mu_\theta$ is deterministic, the action that the behavior policy actually took does not appear in the gradient — only the state $s$ appears, and the actor’s own current output $\mu_\theta(s)$ is plugged into $Q$ . We can therefore evaluate the integrand on data collected by any behavior policy without an importance-sampling correction. That is the key to making the algorithm off-policy and replay-buffer-friendly, which is exactly what we want for sample efficiency in continuous control.

DDPG — Deep Deterministic Policy Gradient¶

DDPG is what you get when you combine the deterministic policy gradient theorem with the two stabilizing tricks from DQN Lillicrap et al., 2016:

A replay buffer of transitions, sampled uniformly at random for each update — possible because the DPG-based actor gradient is off-policy.
Target networks for both actor and critic, updated by Polyak averaging Polyak & Juditsky, 1992Lillicrap et al., 2016 rather than DQN’s hard copy:
$\theta' \leftarrow \tau\theta + (1-\tau)\theta', \qquad w' \leftarrow \tau w + (1-\tau) w',$
(13)
with small $\tau$ (e.g. 0.005) applied every gradient step. This produces a slowly tracking target whose updates are smoother than DQN’s periodic hard sync. (We re-use the symbol $\tau$ — distinct from the trajectory $\tau$ of the previous chapter; context disambiguates.)

A third practical ingredient is needed because the policy is now deterministic and provides no exploration on its own: the agent acts in the environment with additive exploration noise,

a_t = \mu_\theta(s_t) + \epsilon_t,

(14)

where $\epsilon_t$ is drawn from a Gaussian or a temporally-correlated Ornstein–Uhlenbeck process Lillicrap et al., 2016. The resulting noisy actions are stored in the replay buffer; the clean deterministic action $\mu_\theta(s)$ is used inside the gradient.

DDPG combines a deterministic actor \mu_\theta(s) with a Q-critic, trained off-policy from a replay buffer with target-network smoothing. The actor is updated by passing the critic’s action-gradient \nabla_a Q_w(s,a) back through the chain rule (DPG theorem); the critic is trained on TD targets formed with the target actor and critic. Exploration is added as noise to the deterministic action at environment-interaction time. — Figure 5:DDPG combines a deterministic actor $\mu_\theta(s)$ with a Q-critic, trained off-policy from a replay buffer with target-network smoothing. The actor is updated by passing the critic’s action-gradient $\nabla_a Q_w(s,a)$ back through the chain rule (DPG theorem); the critic is trained on TD targets formed with the *target* actor and critic. Exploration is added as noise to the deterministic action at environment-interaction time.

The algorithm¶

Putting all the pieces together gives the DDPG algorithm of Lillicrap et al. (2016):

Initialize actor $\mu_\theta$ and critic $Q_w$ with random parameters; copy them to targets $\theta' \leftarrow \theta$ , $w' \leftarrow w$ . Initialize an empty replay buffer $\mathcal{D}$ .
For each environment step:
1. Observe state $s$ . Take action $a = \mu_\theta(s) + \epsilon$ with exploration noise $\epsilon$ .
2. Execute $a$ ; observe reward $r$ , next state $s'$ , and termination flag $\text{done}$ . Store $(s,a,r,s',\text{done})$ in $\mathcal{D}$ .
3. Sample a random mini-batch $\mathcal{B}$ of transitions from $\mathcal{D}$ .
4. Compute the TD target for each transition:
  $y = \begin{cases} r & \text{if done},\\ r + \gamma\, Q_{w'}\!\bigl(s',\, \mu_{\theta'}(s')\bigr) & \text{otherwise.}\end{cases}$
  (15)
5. Take a gradient step on the critic loss
  $\mathcal{L}(w) = \frac{1}{|\mathcal{B}|}\sum_{\mathcal{B}}\bigl(y - Q_w(s,a)\bigr)^2.$
  (16)
6. Take a gradient step on the actor with the DPG estimator
  $\nabla_\theta J(\theta) \approx \frac{1}{|\mathcal{B}|}\sum_{\mathcal{B}} \nabla_\theta\, \mu_\theta(s)\, \nabla_a Q_w(s,a)\big|_{a=\mu_\theta(s)}.$
  (17)
7. Soft-update both targets: $\theta' \leftarrow \tau\theta + (1-\tau)\theta'$ , $w' \leftarrow \tau w + (1-\tau)w'$ .

What DDPG inherited and what it changed¶

From DQN Mnih et al., 2015	From DPG Silver et al., 2014
Replay buffer for sample efficiency and decorrelation	Deterministic actor $\mu_\theta(s)$
Target networks for stable bootstrapping	Off-policy, IS-correction-free actor gradient
Mini-batch SGD on a TD-style critic loss	Critic gradient $\nabla_a Q$ flows back into actor parameters

DDPG was the algorithm that made deep RL competitive on continuous-control benchmarks (MuJoCo, simulated robotics) for the first time Lillicrap et al., 2016, and it remains the conceptual backbone of all modern continuous-control actor-critics — TD3 Fujimoto et al., 2018 fixes its overestimation bias with clipped double-Q targets, delayed policy updates, and target-policy smoothing; SAC Haarnoja et al., 2018 swaps the deterministic actor for a maximum-entropy stochastic one — but the off-policy actor-critic-with-replay structure is the same.

Summary¶

Actor-critic methods combine value estimation with direct policy optimization through gradients:

The policy gradient theorem lets us replace the Monte-Carlo return $G_t$ with any unbiased estimate of $Q^{\pi_\theta}(s,a)$ — including a learned critic Sutton et al., 2000.
The advantage $A^{\pi_\theta}(s,a) = Q^{\pi_\theta}(s,a) - V^{\pi_\theta}(s)$ is the canonical low-variance multiplier for the policy gradient; the one-step TD error $\delta_t$ is the simplest practical estimate of it.
A2C Mnih et al., 2016 is the synchronous deep instantiation: parallel on-policy rollouts, n-step or GAE Schulman et al., 2016 advantages, a combined policy/value/entropy loss. It is the natural deep extension of REINFORCE-with-a-baseline.
The deterministic policy gradient theorem Silver et al., 2014 turns the actor’s gradient into a chain rule through the critic — no log-probability, no importance sampling, off-policy by construction.
DDPG Lillicrap et al., 2016 applies the DPG theorem with DQN’s replay buffer and target networks to give the first stable, sample-efficient continuous-control deep RL algorithm.

In the next chapter we will look at PPO Schulman et al., 2017, the on-policy stochastic actor-critic that has become the de facto default in deep RL — the algorithm behind everything from robot manipulation to instruction-tuning of large language models Ouyang et al., 2022 — and see how its clipped surrogate objective addresses the step-size problem that A2C does not.

References¶

Williams, R. J. (1992). Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8(3–4), 229–256. 10.1007/BF00992696
Konda, V. R., & Tsitsiklis, J. N. (1999). Actor-Critic Algorithms. Advances in Neural Information Processing Systems, 12, 1008–1014. https://papers.nips.cc/paper_files/paper/1999/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
Hugging Face. (2024). Hugging Face Deep Reinforcement Learning Course. Online course. https://huggingface.co/learn/deep-rl-course/en/unit0/introduction
Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous Methods for Deep Reinforcement Learning. Proceedings of the 33rd International Conference on Machine Learning (ICML), 48, 1928–1937. https://arxiv.org/abs/1602.01783
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). Continuous Control with Deep Reinforcement Learning. Proceedings of the 4th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1509.02971
Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014). Deterministic Policy Gradient Algorithms. Proceedings of the 31st International Conference on Machine Learning (ICML), 32, 387–395. https://proceedings.mlr.press/v32/silver14.html
Greensmith, E., Bartlett, P. L., & Baxter, J. (2004). Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning. Journal of Machine Learning Research, 5, 1471–1530. https://www.jmlr.org/papers/v5/greensmith04a.html
Schulman, J., Moritz, P., Levine, S., Jordan, M. I., & Abbeel, P. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. Proceedings of the 4th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1506.02438
Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy Gradient Methods for Reinforcement Learning with Function Approximation. Advances in Neural Information Processing Systems, 12, 1057–1063. https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-a
Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons.
OpenAI. (2017). OpenAI Baselines: ACKTR & A2C. OpenAI Blog. https://openai.com/index/openai-baselines-acktr-a2c/
Wu, Y., Mansimov, E., Liao, S., Grosse, R., & Ba, J. (2017). Scalable Trust-Region Method for Deep Reinforcement Learning Using Kronecker-Factored Approximation. Advances in Neural Information Processing Systems 30 (NeurIPS). https://arxiv.org/abs/1708.05144
Polyak, B. T., & Juditsky, A. B. (1992). Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization, 30(4), 838–855. 10.1137/0330046
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-Level Control through Deep Reinforcement Learning. Nature, 518(7540), 529–533. 10.1038/nature14236