Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Proximal Policy Optimization

A First-Order Trust Region for On-Policy Actor-Critics

The previous chapter closed with a forward pointer: A2C is on-policy, has no built-in safety net on how far one gradient step can move the policy, and so it needs careful learning-rate tuning to keep training stable. This chapter takes that observation seriously and builds the algorithm that fixed it: Proximal Policy Optimization (PPO) Schulman et al., 2017. PPO replaces the implicit “small step in parameter space” of vanilla policy gradients with an explicit bound on policy change in distribution space — and does so with a clipped surrogate objective that costs nothing more than first-order stochastic gradient descent (SGD). The result is the on-policy default in deep RL: the algorithm behind robotic manipulation, video-game agents, and the Reinforcement Learning from Human Feedback (RLHF) stage of large language model training Ouyang et al., 2022Hugging Face, 2024.

Slides for this chapter (open full screen).

The Step-Size Problem in Policy Gradients

The actor-critic update (6) takes one SGD step on the policy parameters θ\theta in the direction θlogπθ(atst)A^t\nabla_\theta \log\pi_\theta(a_t\mid s_t)\, \hat A_t. Whether that step is good depends not on its norm in parameter space, but on whether the resulting policy distribution πθ+Δθ\pi_{\theta+\Delta\theta} is still close to the policy πθ\pi_\theta that generated the data. Two facts conspire to make this delicate:

The consequence is the classic on-policy failure mode: one update pushes πθ\pi_\theta outside the region where the surrogate is trustworthy, the next batch of data is collected under that degraded policy, the gradient computed on that data points somewhere unhelpful, and recovery is slow or impossible. Vanilla on-policy methods cope with this only through small learning rates and careful schedules — A2C’s “tune α\alpha per environment” workflow is exactly the symptom.

Trust Regions and the Surrogate Objective

Conservative policy iteration Kakade & Langford, 2002 motivates a different way of writing the policy-gradient objective. Instead of differentiating the expected return directly, define a surrogate that uses importance sampling to score a candidate policy πθ\pi_\theta on data collected by πθold\pi_{\theta_\text{old}}:

LCPI(θ)=E(s,a)πθold ⁣[πθ(as)πθold(as)A^t]=Et ⁣[rt(θ)A^t],L^{\text{CPI}}(\theta) = \mathbb{E}_{(s,a)\sim\pi_{\theta_\text{old}}}\!\left[ \frac{\pi_\theta(a\mid s)}{\pi_{\theta_\text{old}}(a\mid s)}\, \hat A_t \right] = \mathbb{E}_t\!\left[r_t(\theta)\, \hat A_t\right],

where the probability ratio

rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_\text{old}}(a_t\mid s_t)}

equals 1 when θ=θold\theta = \theta_\text{old}, so LCPIL^{\text{CPI}} has the same value and gradient as the on-policy expected return at θold\theta_\text{old}. Maximising LCPIL^{\text{CPI}} over many steps without any bound, however, would happily push rt(θ)r_t(\theta) to enormous values for actions with A^t>0\hat A_t > 0 — exactly the step-size problem above, now expressed in ratio form.

To prevent the policy from changing too much with each update, it’s helpful to introduce a measure of how different two probability distributions are. The Kullback–Leibler (KL) divergence is commonly used for this purpose: it quantifies the “distance” between the old and new policies by comparing their output distributions on the same states.

Trust Region Policy Optimization (TRPO) Schulman et al., 2015 incorporates this idea by maximizing LCPIL^{\text{CPI}} while enforcing an explicit constraint on the average KL divergence between the old policy πθold\pi_{\theta_\text{old}} and the new policy πθ\pi_\theta.

maxθ  LCPI(θ)subject toEsπθold ⁣[KL(πθold(s)πθ(s))]δ.\max_{\theta}\; L^{\text{CPI}}(\theta) \quad \text{subject to} \quad \mathbb{E}_{s\sim\pi_{\theta_\text{old}}}\!\left[\mathrm{KL}\bigl(\pi_{\theta_\text{old}}(\cdot\mid s)\,\|\,\pi_\theta(\cdot\mid s)\bigr)\right] \le \delta.

The KL ball {θ:E[KL]δ}\{\theta : \mathbb{E}[\mathrm{KL}] \le \delta\} is the trust region: outside it, the linearised surrogate may no longer reflect the true objective, so the constraint forbids the optimiser from going there.

PPO-Clip — The Central Idea

Enforcing (3) exactly requires the machinery described above — conjugate gradients, Fisher approximations, a line search per update. That leaves a practical hole: trust-region logic is invaluable, yet the implementation fights the grain of mainstream autodiff. Each update computes a natural policy gradient direction: not a naive step in Euclidean θ\theta-space but one weighted by curvature of logπθ\log\pi_\theta with respect to parameters (usually implemented via matrix–vector products with the Fisher information matrix). The optimiser must then line search: shorten the attempted step repeatedly and re-evaluate KL until the update truly lies inside the trust region. Inner linear solves and KL checks woven around plain backprop clash with how neural-network code is ordinarily structured. PPO-Clip is motivated as a way to approximate the same pessimism toward large swings in πθ\pi_\theta using only an ordinary differentiable loss.

PPO replaces TRPO’s hard KL constraint with a soft, first-order bound baked directly into the loss. The clipped surrogate is

LCLIP(θ)=Et ⁣[min ⁣(rt(θ)A^t,  clip(rt(θ),1ϵ,1+ϵ)A^t)],L^{\text{CLIP}}(\theta) = \mathbb{E}_t\!\left[ \min\!\Bigl(\, r_t(\theta)\, \hat A_t,\; \text{clip}\bigl(r_t(\theta),\, 1-\epsilon,\, 1+\epsilon\bigr)\,\hat A_t \Bigr) \right],

where ϵ\epsilon is a small constant (typically 0.10.3; the original paper recommends 0.2 Schulman et al., 2017). Two pieces are at play:

  1. The clip(r,1ϵ,1+ϵ)\textbf{clip}(r, 1-\epsilon, 1+\epsilon) prevents the ratio from contributing a gradient if it has already moved outside the band [1ϵ,1+ϵ][1-\epsilon,\, 1+\epsilon] — that is, if the new policy has already changed too much on this (s,a)(s,a).

  2. The pessimistic min takes the smaller of the unclipped and clipped surrogate values. Crucially, this means the clip only activates in the direction that improves the unclipped objective. Moves that worsen the unclipped objective stay unclipped, so the optimiser never gets to “save” a bad step by clipping it away.

The PPO clipped surrogate L^{\text{CLIP}} as a function of the probability ratio r_t(\theta), for the two signs of the advantage. The trust band r_t \in [1-\epsilon, 1+\epsilon] is shaded blue; the orange dot marks the starting point r_t = 1 where \theta = \theta_\text{old}. The unclipped surrogate r_t \hat A_t is the dashed grey line. The active branch of the \min — the value the gradient sees — is the thick green line. The clip caps the gain only in the direction that would push the policy further from \pi_{\theta_\text{old}}: upward when \hat A_t > 0, downward when \hat A_t < 0. Adapted from .

Figure 2:The PPO clipped surrogate LCLIPL^{\text{CLIP}} as a function of the probability ratio rt(θ)r_t(\theta), for the two signs of the advantage. The trust band rt[1ϵ,1+ϵ]r_t \in [1-\epsilon, 1+\epsilon] is shaded blue; the orange dot marks the starting point rt=1r_t = 1 where θ=θold\theta = \theta_\text{old}. The unclipped surrogate rtA^tr_t \hat A_t is the dashed grey line. The active branch of the min\min — the value the gradient sees — is the thick green line. The clip caps the gain only in the direction that would push the policy further from πθold\pi_{\theta_\text{old}}: upward when A^t>0\hat A_t > 0, downward when A^t<0\hat A_t < 0. Adapted from Schulman et al. (2017).

It is worth walking through the four cases in Figure 2 explicitly, because the asymmetry is the entire trick:

Casertr_t vs bandActive branchEffect
A^t>0\hat A_t > 0, rt[1ϵ,1+ϵ]r_t \in [1-\epsilon, 1+\epsilon]insidertA^tr_t \hat A_tnormal policy-gradient signal
A^t>0\hat A_t > 0, rt>1+ϵr_t > 1+\epsilonalready too high(1+ϵ)A^t(1+\epsilon)\hat A_t (constant)gradient is zero — stop pushing up
A^t>0\hat A_t > 0, rt<1ϵr_t < 1-\epsilon“safe” directionrtA^tr_t \hat A_tclip not active — moving back toward rt=1r_t=1
A^t<0\hat A_t < 0, rt[1ϵ,1+ϵ]r_t \in [1-\epsilon, 1+\epsilon]insidertA^tr_t \hat A_tnormal policy-gradient signal
A^t<0\hat A_t < 0, rt<1ϵr_t < 1-\epsilonalready too low(1ϵ)A^t(1-\epsilon)\hat A_t (constant)gradient is zero — stop pushing down
A^t<0\hat A_t < 0, rt>1+ϵr_t > 1+\epsilon“safe” directionrtA^tr_t \hat A_tclip not active — moving back toward rt=1r_t=1

The pattern: when A^t>0\hat A_t > 0 the action is “good” and the optimiser wants rtr_t above 1, so the clip stops it from going above 1+ϵ1+\epsilon; when A^t<0\hat A_t < 0 the action is “bad” and the optimiser wants rtr_t below 1, so the clip stops it from going below 1ϵ1-\epsilon. In the opposite direction the clip is silent — moving back toward rt=1r_t = 1 is always permitted, even when the ratio has wandered far. That asymmetry is what the min\min encodes.

The Full PPO Loss

Just like A2C (10), PPO bundles a policy term, a value-regression term, and an entropy bonus into one scalar to differentiate:

LPPO(θ,w)=LCLIP(θ)policy term+cVEt ⁣[(Vw(st)R^t)2]value losscHEt ⁣[H(πθ(st))]entropy bonus.\mathcal{L}^{\text{PPO}}(\theta, w) = -\underbrace{L^{\text{CLIP}}(\theta)}_{\text{policy term}} + c_V\, \underbrace{\mathbb{E}_t\!\left[(V_w(s_t) - \hat R_t)^2\right]}_{\text{value loss}} - c_H\, \underbrace{\mathbb{E}_t\!\left[H\bigl(\pi_\theta(\cdot\mid s_t)\bigr)\right]}_{\text{entropy bonus}}.

The value loss regresses the critic onto bootstrapped returns R^t\hat R_t, computed once at rollout time from the GAE estimator Schulman et al., 2016 introduced in the previous chapter. The entropy term plays the same exploration-preserving role as in A2C. Common defaults are cV{0.5,1}c_V \in \{0.5, 1\} and cH0.0c_H \approx 0.00.01 — the original paper used cV=1c_V = 1 for Atari Schulman et al., 2017; modern frameworks (SB3, CleanRL) default to 0.5 Huang et al., 2022.

The structural difference from A2C is not visible in (5). It is in how this loss is used.

The PPO Algorithm

A2C takes one gradient step per round of rollouts: collect N×nN \times n transitions, compute one set of advantages, take one gradient step, throw the data away. PPO instead does KK epochs of mini-batch SGD on the same rollout buffer before discarding it. The clip is precisely what makes that data reuse safe — without it, repeated SGD steps on the surrogate LCPIL^{\text{CPI}} would happily push rtr_t to extreme values.

Each PPO iteration:

  1. Roll out. Run the current policy πθold\pi_{\theta_\text{old}} in NN parallel environments for TT steps each. Store transitions (st,at,logπθold(atst),rt+1,st+1)(s_t, a_t, \log\pi_{\theta_\text{old}}(a_t\mid s_t), r_{t+1}, s_{t+1}) along with the value predictions Vw(st)V_w(s_t).

  2. Compute advantages. Form GAE advantages A^t\hat A_t and value targets R^t=A^t+Vw(st)\hat R_t = \hat A_t + V_w(s_t) for the whole buffer Schulman et al., 2016. The buffer is now frozen: nothing in steps 3–4 will recompute these.

  3. Optimise. For KK epochs, shuffle the buffer into mini-batches and, for each mini-batch, take one gradient step on LPPO(θ,w)\mathcal{L}^{\text{PPO}}(\theta, w). The ratio rt(θ)r_t(\theta) is recomputed every step — its denominator logπθold\log\pi_{\theta_\text{old}} stays fixed; only the numerator depends on the current θ\theta.

  4. Sync. Set θoldθ\theta_\text{old} \leftarrow \theta. Discard the buffer and return to step 1.

The PPO loop. Each iteration produces one frozen rollout buffer (with GAE advantages), then runs K epochs of mini-batch SGD on the same buffer (the dashed blue inner loop). Because \pi_{\theta_\text{old}} is fixed for the duration of those epochs, the ratio r_t(\theta) has a stable denominator and the clip in  measures exactly how far the policy has drifted on each (s,a). Compare with the actor-critic flow: the new structural element is the inner K-epoch loop.

Figure 3:The PPO loop. Each iteration produces one frozen rollout buffer (with GAE advantages), then runs KK epochs of mini-batch SGD on the same buffer (the dashed blue inner loop). Because πθold\pi_{\theta_\text{old}} is fixed for the duration of those epochs, the ratio rt(θ)r_t(\theta) has a stable denominator and the clip in (4) measures exactly how far the policy has drifted on each (s,a)(s,a). Compare with the actor-critic flow: the new structural element is the inner K-epoch loop.

Typical hyperparameters from the original paper and the CleanRL reproductions Schulman et al., 2017Huang et al., 2022: N ⁣ ⁣[8,128]N\!\in\![8, 128] environments, T ⁣ ⁣[128,2048]T\!\in\![128, 2048] steps per rollout, K ⁣ ⁣[3,10]K\!\in\![3, 10] epochs, mini-batches of size 32512, clip ϵ0.2\epsilon \approx 0.2, GAE λ0.95\lambda \approx 0.95, discount γ0.99\gamma \approx 0.99.

Why Clipping Works — A First-Order Trust Region

The clip is best understood as a first-order, separable approximation of TRPO’s KL constraint. TRPO’s trust region is a single KL ball in policy-distribution space, enforced by a constrained optimisation. PPO’s trust region is an axis-aligned box in ratio space — one constraint per (s,a)(s, a) — enforced by zeroing the gradient at the box boundary. Both bound how far the new policy can move from the old one before further gradient information is ignored; PPO trades the precision of an explicit KL ball for the simplicity of plain SGD.

A schematic of the three update styles in policy parameter space. Vanilla policy gradient (red) takes a step proportional to the gradient with no bound on distributional change and can leave the region where the linearised surrogate is trustworthy. TRPO (purple) caps the step at the boundary of an explicit KL ball \mathbb E[\mathrm{KL}] \le \delta via a constrained second-order solve. PPO-Clip (blue) caps the step at the boundary of an axis-aligned ratio band r_t \in [1-\epsilon, 1+\epsilon] via a first-order clip in the loss — no constrained solver, no Fisher matrix.

Figure 4:A schematic of the three update styles in policy parameter space. Vanilla policy gradient (red) takes a step proportional to the gradient with no bound on distributional change and can leave the region where the linearised surrogate is trustworthy. TRPO (purple) caps the step at the boundary of an explicit KL ball E[KL]δ\mathbb E[\mathrm{KL}] \le \delta via a constrained second-order solve. PPO-Clip (blue) caps the step at the boundary of an axis-aligned ratio band rt[1ϵ,1+ϵ]r_t \in [1-\epsilon, 1+\epsilon] via a first-order clip in the loss — no constrained solver, no Fisher matrix.

Empirically the trade is good. Across the standard MuJoCo and Atari benchmarks, PPO matches or exceeds TRPO’s performance with a small fraction of the implementation complexity Schulman et al., 2017. The first-order formulation also adapts more naturally to architectures and training regimes where a Fisher-matrix approximation is awkward: recurrent policies (where the Fisher is defined over sequence distributions), very large parallel rollouts, and mixed-precision training.

Practical Details That Matter

PPO’s pseudocode is short, but a faithful re-implementation needs a handful of conventions that are not in the paper but matter substantially in practice. Engstrom et al. (2020) and Andrychowicz et al. (2021) showed in independent large-scale studies that several of these “code-level optimisations” account for as much of PPO’s measured advantage over TRPO as the clipped objective itself does, and Huang et al. (2022) enumerate 37 of them in the CleanRL-aligned blog post that has become the de-facto reference. The most impactful, kept here at the level of pointers:

These are not pedagogical fluff. Engstrom et al. (2020) report that an “honest” PPO with the clipped objective but without the code-level optimisations performs comparably to vanilla TRPO; the gap PPO is famous for largely closes when the implementation tricks are equalised. The takeaway is the same one that runs through this course: in deep RL the algorithm and the engineering are inseparable, and any benchmark comparison that ignores the latter is suspect Henderson et al., 2018.

Empirical Evidence

PPO’s reputation as the on-policy default rests on its robustness across very different domains with one near-universal hyperparameter setting. The CleanRL benchmarks Huang et al., 2022 make this concrete: with a single PPO config (clip 0.2, λ\lambda 0.95, K=4K=4 epochs, T=128T=128 steps, learning rate 2.5×1042.5\times10^{-4} annealed linearly), the same algorithm matches or beats published baselines on classic-control, MuJoCo continuous control, Atari pixel observations, and procedurally generated environments — see the per-environment learning curves at cleanRL’s WandB. That breadth is what made PPO the default backbone for downstream applications: the InstructGPT/RLHF pipeline used PPO with the KL-penalty variant and a fixed reference policy Ouyang et al., 2022 — an approach that remains common in production systems, even as newer methods such as GRPO and DPO have emerged.

Summary

PPO is the algorithm that made on-policy actor-critics reliable, sample-efficient, and easy to tune.

PPO is the conceptual end-point of this course’s tour through on-policy methods: it is what you reach for when sample efficiency comes second to robustness and ease of tuning, when you have a fast simulator and many parallel environments, or when you want a known-quantity algorithm for downstream work like LLM fine-tuning. Next we will focus on improving the sample efficiency using a learned model of the environment.

Footnotes
References
  1. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv Preprint arXiv:1707.06347.
  2. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., & Lowe, R. (2022). Training Language Models to Follow Instructions with Human Feedback. Advances in Neural Information Processing Systems (NeurIPS), 35, 27730–27744.
  3. Hugging Face. (2024). Hugging Face Deep Reinforcement Learning Course. Online course. https://huggingface.co/learn/deep-rl-course/en/unit0/introduction
  4. Kakade, S. M., & Langford, J. (2002). Approximately Optimal Approximate Reinforcement Learning. Proceedings of the 19th International Conference on Machine Learning (ICML), 267–274.
  5. Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., & Moritz, P. (2015). Trust Region Policy Optimization. Proceedings of the 32nd International Conference on Machine Learning (ICML), 37, 1889–1897.
  6. Schulman, J., Moritz, P., Levine, S., Jordan, M. I., & Abbeel, P. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. Proceedings of the 4th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1506.02438
  7. Huang, S., Dossa, R. F. J., Raffin, A., Kanervisto, A., & Wang, W. (2022). The 37 Implementation Details of Proximal Policy Optimization. ICLR Blog Track. https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
  8. Huang, S., Dossa, R. F. J., Ye, C., Braga, J., Chakraborty, D., Mehta, K., & Raffin, A. (2022). CleanRL: High-Quality Single-File Implementations of Deep Reinforcement Learning Algorithms. Journal of Machine Learning Research, 23(274), 1–18.
  9. Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Janoos, F., Rudolph, L., & Madry, A. (2020). Implementation Matters in Deep RL: A Case Study on PPO and TRPO. Proceedings of the 8th International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=r1etN1rtPB
  10. Andrychowicz, M., Raichuk, A., Stańczyk, P., Orsini, M., Girgin, S., Marinier, R., Hussenot, L., Geist, M., Pietquin, O., Michalski, M., Gelly, S., & Bachem, O. (2021). What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study. Proceedings of the 9th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2006.05990
  11. Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2018). Deep Reinforcement Learning That Matters. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). 10.1609/aaai.v32i1.11694