Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Actor-Critic Methods

Policy Gradients with a Learned Critic — A2C and DDPG

The previous chapter showed that we can optimize a parameterized policy πθ(as)\pi_\theta(a\mid s) directly by gradient ascent on expected return, using the score-function estimator Williams, 1992. REINFORCE is unbiased, conceptually clean, and the gradient can be calculated without differentiating the environment. It also taught us, by example, why nobody ships REINFORCE in production: a single Monte-Carlo return GtG_t is a very noisy estimate of how good action ata_t was, and that noise propagates into the gradient. We saw at the end of the last chapter that subtracting a state-only baseline b(st)b(s_t) leaves the gradient unbiased while reducing variance, and that the natural baseline is the state-value function Vπ(st)V^{\pi}(s_t) — turning the multiplier GtG_t into the advantage Aπ(st,at)=GtVπ(st)A^{\pi}(s_t,a_t) = G_t - V^{\pi}(s_t).

This chapter takes the next step: replace the Monte-Carlo return with a learned critic. The result is the actor-critic family, which sits at the heart of modern deep RL Konda & Tsitsiklis, 1999Sutton & Barto, 2018Hugging Face, 2024. We will state the policy gradient theorem that makes this substitution legal in expectation, derive the simplest one-step actor-critic update, and then look at two of the most important deep instantiations: A2C Mnih et al., 2016 for discrete and continuous control with stochastic policies, and DDPG Lillicrap et al., 2016 — built on the deterministic policy gradient theorem Silver et al., 2014 — for continuous control with a deterministic actor.

Slides for this chapter (open full screen).

The Variance Problem in REINFORCE

Recall the REINFORCE-with-baseline gradient estimator:

θJ(θ)=Eτπθ[t=0T1θlogπθ(atst)(GtVπ(st))].\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t\mid s_t)\,\bigl(G_t - V^{\pi}(s_t)\bigr)\right].

The estimator is unbiased, but GtG_t is a whole-episode sum of stochastic rewards. Two distinct sources of randomness pile up inside it: the environment’s transitions and rewards, and the policy’s own action sampling. Two trajectories that started in the same state sts_t and took the same action ata_t can produce wildly different returns, just because the rest of the episode unfolded differently. This is the well-documented variance problem of Monte-Carlo policy gradients Williams, 1992Greensmith et al., 2004Sutton & Barto, 2018: the estimator is unbiased, but driving the noise down requires many trajectories per gradient step — and that is precisely the sample inefficiency that motivates everything in this chapter.

A learned baseline Vw(st)V_w(s_t) helps, but only by removing the state-dependent part of the variance — the part of GtG_t that is “expected before the action is taken.” It does nothing about the variance contributed by the rest of the trajectory after ata_t. The natural next step is to replace the entire GtG_t with a learned estimate.

The Policy Gradient Theorem

The REINFORCE gradient is one form of θJ(θ)\nabla_\theta J(\theta) — the one that uses a sampled return. There is a more general statement, the policy gradient theorem Sutton et al., 2000, that holds for any sufficiently nice MDP and any differentiable parameterization πθ\pi_\theta:

θJ(θ)=Esdπθ,aπθ(s)[θlogπθ(as)Qπθ(s,a)],\nabla_\theta J(\theta) = \mathbb{E}_{s\sim d^{\pi_\theta},\, a\sim \pi_\theta(\cdot\mid s)}\left[ \nabla_\theta \log \pi_\theta(a\mid s)\, Q^{\pi_\theta}(s,a) \right],

where dπθ(s)d^{\pi_\theta}(s) is the (discounted) state-visitation distribution induced by πθ\pi_\theta and Qπθ(s,a)Q^{\pi_\theta}(s,a) is the action-value function under πθ\pi_\theta. The Monte-Carlo estimator (1) is recovered by substituting GtG_t as an unbiased sample of Qπθ(st,at)Q^{\pi_\theta}(s_t,a_t).

The theorem matters because the multiplier Qπθ(s,a)Q^{\pi_\theta}(s,a) is a function of (s,a)(s,a) alone — it does not depend on the rest of the trajectory. If we had access to QπθQ^{\pi_\theta} exactly, we would get a much lower-variance estimator than GtG_t, at no cost in bias. We don’t — but we can learn an approximation.

Subtracting a baseline gives the advantage form

Adding any function b(s)b(s) that depends only on the state to the multiplier in (2) leaves the expectation unchanged, by exactly the same argument used for REINFORCE in the previous chapter. A natural and effective choice is b(s)=Vπθ(s)b(s) = V^{\pi_\theta}(s) — it captures the state-dependent component of the return that every action shares — which yields the advantage form:

θJ(θ)=Es,a[θlogπθ(as)Aπθ(s,a)],Aπθ(s,a)=Qπθ(s,a)Vπθ(s).\nabla_\theta J(\theta) = \mathbb{E}_{s,a}\left[\nabla_\theta \log \pi_\theta(a\mid s)\, A^{\pi_\theta}(s,a)\right], \qquad A^{\pi_\theta}(s,a) = Q^{\pi_\theta}(s,a) - V^{\pi_\theta}(s).

The advantage answers a more focused question than QπθQ^{\pi_\theta} alone: how much better is action aa than what the policy would do on average from state ss? Centering by VπθV^{\pi_\theta} removes the part of QπθQ^{\pi_\theta} that is the same for every action and thus carries no useful gradient signal. VπθV^{\pi_\theta} is the de facto standard baseline in modern actor-critic methods, though it is not strictly variance-minimizing: Greensmith et al. (2004) show that the optimal scalar baseline is a score-norm-weighted version of QπθQ^{\pi_\theta}, and that even using the true value function as the critic can be suboptimal. In practice the simplification to VπθV^{\pi_\theta} is universal because it ties cleanly into the value-function machinery we already need.

Actor-Critic: The General Recipe

Actor-critic methods turn the advantage form (3) into an algorithm by parameterizing two networks:

Think of the actor as a player and the critic as a friend watching over their shoulder, telling them after each move whether things went better or worse than expected Hugging Face, 2024. The actor uses that feedback to push the probability of good actions up and bad ones down; the critic, in parallel, fits its predictions to what actually happens.

TD-bootstrapped advantage

We almost never have AπθA^{\pi_\theta} in closed form. The simplest replacement is the one-step temporal-difference (TD) error:

δt=rt+1+γVw(st+1)Vw(st).\delta_t = r_{t+1} + \gamma\, V_w(s_{t+1}) - V_w(s_t).

Two facts make δt\delta_t a useful surrogate for the advantage. First, conditioning on (st,at)(s_t, a_t) and substituting the true value function VπθV^{\pi_\theta} for VwV_w,

E[rt+1+γVπθ(st+1)Vπθ(st)    st,at]=Qπθ(st,at)Vπθ(st)=Aπθ(st,at),\mathbb{E}\bigl[r_{t+1} + \gamma V^{\pi_\theta}(s_{t+1}) - V^{\pi_\theta}(s_t)\;\big|\;s_t,a_t\bigr] = Q^{\pi_\theta}(s_t,a_t) - V^{\pi_\theta}(s_t) = A^{\pi_\theta}(s_t,a_t),

so δt\delta_t is a one-sample estimate of the advantage. Second, it uses a bootstrap: the right-hand side mixes one real reward with the critic’s own prediction at the next state. Bootstrapping introduces bias when VwVπθV_w \neq V^{\pi_\theta}, but it dramatically reduces variance compared to the full Monte-Carlo return GtG_t, which sums many noisy reward terms. This is the same bias/variance trade-off that distinguishes TD from MC for value-based methods Sutton & Barto, 2018, now applied to the critic of a policy-gradient method.

The actor-critic architecture. The actor \pi_\theta(a\mid s) samples actions; the critic V_w(s) scores states. The bootstrap disagreement between the critic’s two predictions and the observed reward — the TD error \delta_t — provides both an advantage signal for the actor (blue update) and a regression target for the critic (orange update).

Figure 2:The actor-critic architecture. The actor πθ(as)\pi_\theta(a\mid s) samples actions; the critic Vw(s)V_w(s) scores states. The bootstrap disagreement between the critic’s two predictions and the observed reward — the TD error δt\delta_t — provides both an advantage signal for the actor (blue update) and a regression target for the critic (orange update).

One-step actor-critic update

With δt\delta_t playing the role of Aπθ(st,at)A^{\pi_\theta}(s_t,a_t), each environment step yields two updates:

θθ+απθlogπθ(atst)δt,\theta \leftarrow \theta + \alpha_\pi\, \nabla_\theta \log \pi_\theta(a_t\mid s_t)\, \delta_t,
ww+αVδtwVw(st).w \leftarrow w + \alpha_V\, \delta_t\, \nabla_w V_w(s_t).

The actor update (6) is exactly REINFORCE — but with δt\delta_t in place of Gtb(st)G_t - b(s_t). The critic update (7) is semi-gradient TD(0): we move Vw(st)V_w(s_t) toward the bootstrapped target rt+1+γVw(st+1)r_{t+1} + \gamma V_w(s_{t+1}), treating the target as fixed.

Putting these together gives the one-step actor-critic algorithm:

  1. Initialize actor parameters θ\theta and critic parameters ww. Choose learning rates απ,αV>0\alpha_\pi, \alpha_V > 0 and discount γ\gamma.

  2. For each environment step:

    1. Observe state sts_t. Sample action atπθ(st)a_t \sim \pi_\theta(\cdot\mid s_t).

    2. Execute ata_t; observe reward rt+1r_{t+1} and next state st+1s_{t+1}.

    3. Compute the TD error δt=rt+1+γVw(st+1)Vw(st)\delta_t = r_{t+1} + \gamma V_w(s_{t+1}) - V_w(s_t).

    4. Update the actor: θθ+απθlogπθ(atst)δt\theta \leftarrow \theta + \alpha_\pi\, \nabla_\theta \log \pi_\theta(a_t\mid s_t)\, \delta_t.

    5. Update the critic: ww+αVδtwVw(st)w \leftarrow w + \alpha_V\, \delta_t\, \nabla_w V_w(s_t).

The one-step actor-critic loop. Each environment step produces a TD error \delta_t that simultaneously serves as an advantage estimate for the actor (blue) and a regression residual for the critic (orange). Compare with the REINFORCE flow: we no longer wait for the episode to end before updating.

Figure 3:The one-step actor-critic loop. Each environment step produces a TD error δt\delta_t that simultaneously serves as an advantage estimate for the actor (blue) and a regression residual for the critic (orange). Compare with the REINFORCE flow: we no longer wait for the episode to end before updating.

The loop runs entirely online — no episode boundary required, no Monte-Carlo return to compute. This is the structural change that makes actor-critic substantially more sample-efficient than REINFORCE.

A subtle point: this one-step actor-critic is already a complete algorithm. We could run it on a single environment, with single-sample gradients, and it would (slowly) work. The deep-RL twist that we will see next — collecting transitions in parallel from NN environments before each gradient step — is a practical fix for the noisy-single-sample problem when training neural networks, not part of the actor-critic recipe itself.

A spectrum of advantage estimators

The one-step TD error and the full Monte-Carlo return are the extremes of a continuum. Between them sit the n-step returns

Gt(n)=rt+1+γrt+2++γn1rt+n+γnVw(st+n),G^{(n)}_t = r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{n-1} r_{t+n} + \gamma^n V_w(s_{t+n}),

and generalized advantage estimation (GAE) Schulman et al., 2016, which exponentially averages all nn-step advantages by a parameter λ[0,1]\lambda \in [0,1]. As nn (or λ\lambda) increases, the estimator uses more real reward and less bootstrap, trading bias down and variance up. Most modern implementations of A2C and PPO use GAE.

A2C — Advantage Actor-Critic

The most pedagogically clean deep actor-critic algorithm is A2C, the synchronous variant of Mnih et al. (2016)’s A3C (“asynchronous advantage actor-critic”). A3C used many independent worker processes that asynchronously updated a shared parameter server with their own rollouts. A2C — popularized by OpenAI’s baselines release OpenAI, 2017 and used as the policy-gradient backbone of Wu et al. (2017)’s ACKTR — instead runs NN environments in lockstep and aggregates their data into one synchronous gradient step per round. Empirically, the synchronous variant matches or exceeds A3C while making far better use of GPU batching OpenAI, 2017.

The algorithm

Each iteration:

  1. Reset (if needed) NN parallel environments. Run the current policy πθ\pi_\theta in all of them for nn steps, producing a batch of N×nN \times n transitions.

  2. For each step in the batch, compute the bootstrapped nn-step return target and advantage estimate using the current critic VwV_w:

    R^t=k=0n1γkrt+k+1+γnVw(st+n),A^t=R^tVw(st),\hat R_t = \sum_{k=0}^{n-1}\gamma^k r_{t+k+1} + \gamma^n V_w(s_{t+n}), \qquad \hat A_t = \hat R_t - V_w(s_t),

    or the corresponding GAE estimate Schulman et al., 2016.

  3. Compute the combined loss, summed across the batch:

    L(θ,w)=tlogπθ(atst)A^tpolicy loss+cVt(R^tVw(st))2value losscHtH(πθ(st))entropy bonus,\mathcal{L}(\theta, w) = -\underbrace{\sum_t \log \pi_\theta(a_t\mid s_t)\,\hat A_t}_{\text{policy loss}} + c_V \underbrace{\sum_t \bigl(\hat R_t - V_w(s_t)\bigr)^2}_{\text{value loss}} - c_H \underbrace{\sum_t H\bigl(\pi_\theta(\cdot\mid s_t)\bigr)}_{\text{entropy bonus}},

    where H(π(s))=aπ(as)logπ(as)H(\pi(\cdot\mid s)) = -\sum_a \pi(a\mid s)\log\pi(a\mid s) is the entropy of the action distribution (the differential entropy in the continuous case).

  4. Take one gradient step on L\mathcal{L} with respect to (θ,w)(\theta, w). The same updated parameters are then used for the next round of rollouts in all NN environments.

The coefficients cV,cH>0c_V, c_H > 0 are hyperparameters; common defaults inherited from A3C and the OpenAI baselines implementations are cV0.5c_V \approx 0.5 and cH0.01c_H \approx 0.01 Mnih et al., 2016OpenAI, 2017.

A2C runs N environments synchronously, each taking n steps with the current policy. The combined batch of N \times n transitions yields one gradient update; the updated shared policy is then used by all environments for the next round. Parallelism alone provides the sample decorrelation that DQN gets from a replay buffer.

Figure 4:A2C runs NN environments synchronously, each taking nn steps with the current policy. The combined batch of N×nN \times n transitions yields one gradient update; the updated shared policy is then used by all environments for the next round. Parallelism alone provides the sample decorrelation that DQN gets from a replay buffer.

Why parallel environments?

A2C is on-policy: the gradient estimate (10) is only valid if the data was collected under the current πθ\pi_\theta. That rules out a DQN-style replay buffer of stale transitions. But on-policy training from a single environment leaves us with the same problem REINFORCE had — every gradient step depends on a small number of highly correlated trajectories. Running NN environments in parallel gives us a near-i.i.d.-looking batch at the current policy, which is enough to make stochastic gradient descent behave well Mnih et al., 2016. It is also a particularly clean fit for modern accelerators: NN parallel rollouts can be batched into a single GPU forward pass.

The entropy bonus

The cHH(πθ(st))-c_H H(\pi_\theta(\cdot\mid s_t)) term in (10) is a small positive reward for keeping the policy distribution spread out. Without it, an actor-critic agent that finds even a moderately good action tends to collapse to a near-deterministic policy too quickly, before the critic has had a chance to evaluate alternatives. The entropy bonus is a soft exploration mechanism; the same idea reappears prominently in PPO and SAC.

Continuous Control: The Deterministic Policy Gradient Theorem

A2C handles continuous actions by sampling from a Gaussian policy, just like the discrete case but with a different output head — and this works well in practice (PPO with Gaussian heads remains a strong baseline for continuous control). A different idea, going back to Silver et al. (2014), is to abandon stochastic policies for control and learn a deterministic mapping a=μθ(s)a = \mu_\theta(s) directly, training it by gradient flow through a critic that knows how the value changes with the action. This trades the variance from action sampling for a different, often smaller, source of variance — an estimate of aQ\nabla_a Q — and, as we will see, makes off-policy training natural.

This is, of course, what was hard to do in DQN: solving argmaxaQθ(s,a)\arg\max_a Q_\theta(s,a) at every step. The deterministic policy gradient theorem makes the substitution work because the actor is differentiable, not because the argmax becomes any easier.

The theorem

Let μθ:SA\mu_\theta : \mathcal{S} \to \mathcal{A} be a deterministic policy with parameters θ\theta, Qμ(s,a)Q^\mu(s,a) the action-value function under μθ\mu_\theta, and ρμ(s)\rho^\mu(s) the discounted state-visitation distribution under μθ\mu_\theta — the same object as dπθd^{\pi_\theta} in (2), written ρμ\rho^\mu here to follow the notation of Silver et al. (2014). Under standard regularity conditions, Silver et al. (2014) show that

θJ(μθ)=Esρμ[θμθ(s)aQμ(s,a)a=μθ(s)].\nabla_\theta J(\mu_\theta) = \mathbb{E}_{s\sim \rho^\mu}\left[ \nabla_\theta\, \mu_\theta(s)\, \nabla_a Q^\mu(s,a)\big|_{a = \mu_\theta(s)} \right].

Read (11) as a chain rule. Define a scalar objective J(θ)=Esρμ[Qμ(s,μθ(s))]J(\theta) = \mathbb{E}_{s\sim\rho^\mu}\bigl[Q^\mu(s, \mu_\theta(s))\bigr]. Differentiating through the actor’s output gives, pointwise in ss,

θQμ(s,μθ(s))=θμθ(s)aQμ(s,a)a=μθ(s),\nabla_\theta\, Q^\mu(s, \mu_\theta(s)) = \nabla_\theta \mu_\theta(s)\, \nabla_a Q^\mu(s, a)\big|_{a=\mu_\theta(s)},

which is exactly the integrand on the right-hand side of (11). The “policy gradient” reduces to: for each state, find the direction in action space that increases QQ, then ask the actor’s parameters to move the action that way.

Off-policy is free

A second consequence: because μθ\mu_\theta is deterministic, the action that the behavior policy actually took does not appear in the gradient — only the state ss appears, and the actor’s own current output μθ(s)\mu_\theta(s) is plugged into QQ. We can therefore evaluate the integrand on data collected by any behavior policy without an importance-sampling correction. That is the key to making the algorithm off-policy and replay-buffer-friendly, which is exactly what we want for sample efficiency in continuous control.

DDPG — Deep Deterministic Policy Gradient

DDPG is what you get when you combine the deterministic policy gradient theorem with the two stabilizing tricks from DQN Lillicrap et al., 2016:

A third practical ingredient is needed because the policy is now deterministic and provides no exploration on its own: the agent acts in the environment with additive exploration noise,

at=μθ(st)+ϵt,a_t = \mu_\theta(s_t) + \epsilon_t,

where ϵt\epsilon_t is drawn from a Gaussian or a temporally-correlated Ornstein–Uhlenbeck process Lillicrap et al., 2016. The resulting noisy actions are stored in the replay buffer; the clean deterministic action μθ(s)\mu_\theta(s) is used inside the gradient.

DDPG combines a deterministic actor \mu_\theta(s) with a Q-critic, trained off-policy from a replay buffer with target-network smoothing. The actor is updated by passing the critic’s action-gradient \nabla_a Q_w(s,a) back through the chain rule (DPG theorem); the critic is trained on TD targets formed with the target actor and critic. Exploration is added as noise to the deterministic action at environment-interaction time.

Figure 5:DDPG combines a deterministic actor μθ(s)\mu_\theta(s) with a Q-critic, trained off-policy from a replay buffer with target-network smoothing. The actor is updated by passing the critic’s action-gradient aQw(s,a)\nabla_a Q_w(s,a) back through the chain rule (DPG theorem); the critic is trained on TD targets formed with the target actor and critic. Exploration is added as noise to the deterministic action at environment-interaction time.

The algorithm

Putting all the pieces together gives the DDPG algorithm of Lillicrap et al. (2016):

  1. Initialize actor μθ\mu_\theta and critic QwQ_w with random parameters; copy them to targets θθ\theta' \leftarrow \theta, www' \leftarrow w. Initialize an empty replay buffer D\mathcal{D}.

  2. For each environment step:

    1. Observe state ss. Take action a=μθ(s)+ϵa = \mu_\theta(s) + \epsilon with exploration noise ϵ\epsilon.

    2. Execute aa; observe reward rr, next state ss', and termination flag done\text{done}. Store (s,a,r,s,done)(s,a,r,s',\text{done}) in D\mathcal{D}.

    3. Sample a random mini-batch B\mathcal{B} of transitions from D\mathcal{D}.

    4. Compute the TD target for each transition:

      y={rif done,r+γQw ⁣(s,μθ(s))otherwise.y = \begin{cases} r & \text{if done},\\ r + \gamma\, Q_{w'}\!\bigl(s',\, \mu_{\theta'}(s')\bigr) & \text{otherwise.}\end{cases}
    5. Take a gradient step on the critic loss

      L(w)=1BB(yQw(s,a))2.\mathcal{L}(w) = \frac{1}{|\mathcal{B}|}\sum_{\mathcal{B}}\bigl(y - Q_w(s,a)\bigr)^2.
    6. Take a gradient step on the actor with the DPG estimator

      θJ(θ)1BBθμθ(s)aQw(s,a)a=μθ(s).\nabla_\theta J(\theta) \approx \frac{1}{|\mathcal{B}|}\sum_{\mathcal{B}} \nabla_\theta\, \mu_\theta(s)\, \nabla_a Q_w(s,a)\big|_{a=\mu_\theta(s)}.
    7. Soft-update both targets: θτθ+(1τ)θ\theta' \leftarrow \tau\theta + (1-\tau)\theta', wτw+(1τ)ww' \leftarrow \tau w + (1-\tau)w'.

What DDPG inherited and what it changed

From DQN Mnih et al., 2015From DPG Silver et al., 2014
Replay buffer for sample efficiency and decorrelationDeterministic actor μθ(s)\mu_\theta(s)
Target networks for stable bootstrappingOff-policy, IS-correction-free actor gradient
Mini-batch SGD on a TD-style critic lossCritic gradient aQ\nabla_a Q flows back into actor parameters

DDPG was the algorithm that made deep RL competitive on continuous-control benchmarks (MuJoCo, simulated robotics) for the first time Lillicrap et al., 2016, and it remains the conceptual backbone of all modern continuous-control actor-critics — TD3 Fujimoto et al., 2018 fixes its overestimation bias with clipped double-Q targets, delayed policy updates, and target-policy smoothing; SAC Haarnoja et al., 2018 swaps the deterministic actor for a maximum-entropy stochastic one — but the off-policy actor-critic-with-replay structure is the same.

Summary

Actor-critic methods combine value estimation with direct policy optimization through gradients:

In the next chapter we will look at PPO Schulman et al., 2017, the on-policy stochastic actor-critic that has become the de facto default in deep RL — the algorithm behind everything from robot manipulation to instruction-tuning of large language models Ouyang et al., 2022 — and see how its clipped surrogate objective addresses the step-size problem that A2C does not.

References
  1. Williams, R. J. (1992). Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 8(3–4), 229–256. 10.1007/BF00992696
  2. Konda, V. R., & Tsitsiklis, J. N. (1999). Actor-Critic Algorithms. Advances in Neural Information Processing Systems, 12, 1008–1014. https://papers.nips.cc/paper_files/paper/1999/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html
  3. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
  4. Hugging Face. (2024). Hugging Face Deep Reinforcement Learning Course. Online course. https://huggingface.co/learn/deep-rl-course/en/unit0/introduction
  5. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T. P., Harley, T., Silver, D., & Kavukcuoglu, K. (2016). Asynchronous Methods for Deep Reinforcement Learning. Proceedings of the 33rd International Conference on Machine Learning (ICML), 48, 1928–1937. https://arxiv.org/abs/1602.01783
  6. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). Continuous Control with Deep Reinforcement Learning. Proceedings of the 4th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1509.02971
  7. Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., & Riedmiller, M. (2014). Deterministic Policy Gradient Algorithms. Proceedings of the 31st International Conference on Machine Learning (ICML), 32, 387–395. https://proceedings.mlr.press/v32/silver14.html
  8. Greensmith, E., Bartlett, P. L., & Baxter, J. (2004). Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning. Journal of Machine Learning Research, 5, 1471–1530. https://www.jmlr.org/papers/v5/greensmith04a.html
  9. Schulman, J., Moritz, P., Levine, S., Jordan, M. I., & Abbeel, P. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. Proceedings of the 4th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1506.02438
  10. Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy Gradient Methods for Reinforcement Learning with Function Approximation. Advances in Neural Information Processing Systems, 12, 1057–1063. https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-a
  11. Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons.
  12. OpenAI. (2017). OpenAI Baselines: ACKTR & A2C. OpenAI Blog. https://openai.com/index/openai-baselines-acktr-a2c/
  13. Wu, Y., Mansimov, E., Liao, S., Grosse, R., & Ba, J. (2017). Scalable Trust-Region Method for Deep Reinforcement Learning Using Kronecker-Factored Approximation. Advances in Neural Information Processing Systems 30 (NeurIPS). https://arxiv.org/abs/1708.05144
  14. Polyak, B. T., & Juditsky, A. B. (1992). Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization, 30(4), 838–855. 10.1137/0330046
  15. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-Level Control through Deep Reinforcement Learning. Nature, 518(7540), 529–533. 10.1038/nature14236