Proximal Policy Optimization
A First-Order Trust Region for On-Policy Actor-Critics
The previous chapter closed with a forward pointer: A2C is on-policy, has no built-in safety net on how far one gradient step can move the policy, and so it needs careful learning-rate tuning to keep training stable. This chapter takes that observation seriously and builds the algorithm that fixed it: Proximal Policy Optimization (PPO) Schulman et al., 2017. PPO replaces the implicit “small step in parameter space” of vanilla policy gradients with an explicit bound on policy change in distribution space — and does so with a clipped surrogate objective that costs nothing more than first-order stochastic gradient descent (SGD). The result is the on-policy default in deep RL: the algorithm behind robotic manipulation, video-game agents, and the Reinforcement Learning from Human Feedback (RLHF) stage of large language model training Ouyang et al., 2022Hugging Face, 2024.
Slides for this chapter (open full screen).
The Step-Size Problem in Policy Gradients¶
The actor-critic update (6) takes one SGD step on the policy parameters in the direction . Whether that step is good depends not on its norm in parameter space, but on whether the resulting policy distribution is still close to the policy that generated the data. Two facts conspire to make this delicate:
The gradient is computed under . The local linearisation of the expected return — the surrogate that policy gradient SGD is implicitly maximising — is only trustworthy in a neighbourhood where has not yet moved much from . Beyond that neighbourhood, the linearisation is just a guess Kakade & Langford, 2002.
A single gradient step does not bound distributional change. A small parameter step can correspond to a large change in — for example, a logit jumping from 1.0 to 4.0 collapses a near-uniform softmax to a near-deterministic one.
The consequence is the classic on-policy failure mode: one update pushes outside the region where the surrogate is trustworthy, the next batch of data is collected under that degraded policy, the gradient computed on that data points somewhere unhelpful, and recovery is slow or impossible. Vanilla on-policy methods cope with this only through small learning rates and careful schedules — A2C’s “tune per environment” workflow is exactly the symptom.
Trust Regions and the Surrogate Objective¶
Conservative policy iteration Kakade & Langford, 2002 motivates a different way of writing the policy-gradient objective. Instead of differentiating the expected return directly, define a surrogate that uses importance sampling to score a candidate policy on data collected by :
where the probability ratio
equals 1 when , so has the same value and gradient as the on-policy expected return at . Maximising over many steps without any bound, however, would happily push to enormous values for actions with — exactly the step-size problem above, now expressed in ratio form.
To prevent the policy from changing too much with each update, it’s helpful to introduce a measure of how different two probability distributions are. The Kullback–Leibler (KL) divergence is commonly used for this purpose: it quantifies the “distance” between the old and new policies by comparing their output distributions on the same states.
Trust Region Policy Optimization (TRPO) Schulman et al., 2015 incorporates this idea by maximizing while enforcing an explicit constraint on the average KL divergence between the old policy and the new policy .
The KL ball is the trust region: outside it, the linearised surrogate may no longer reflect the true objective, so the constraint forbids the optimiser from going there.
PPO-Clip — The Central Idea¶
Enforcing (3) exactly requires the machinery described above — conjugate gradients, Fisher approximations, a line search per update. That leaves a practical hole: trust-region logic is invaluable, yet the implementation fights the grain of mainstream autodiff. Each update computes a natural policy gradient direction: not a naive step in Euclidean -space but one weighted by curvature of with respect to parameters (usually implemented via matrix–vector products with the Fisher information matrix). The optimiser must then line search: shorten the attempted step repeatedly and re-evaluate KL until the update truly lies inside the trust region. Inner linear solves and KL checks woven around plain backprop clash with how neural-network code is ordinarily structured. PPO-Clip is motivated as a way to approximate the same pessimism toward large swings in using only an ordinary differentiable loss.
PPO replaces TRPO’s hard KL constraint with a soft, first-order bound baked directly into the loss. The clipped surrogate is
where is a small constant (typically 0.1–0.3; the original paper recommends 0.2 Schulman et al., 2017). Two pieces are at play:
The prevents the ratio from contributing a gradient if it has already moved outside the band — that is, if the new policy has already changed too much on this .
The pessimistic min takes the smaller of the unclipped and clipped surrogate values. Crucially, this means the clip only activates in the direction that improves the unclipped objective. Moves that worsen the unclipped objective stay unclipped, so the optimiser never gets to “save” a bad step by clipping it away.
Figure 2:The PPO clipped surrogate as a function of the probability ratio , for the two signs of the advantage. The trust band is shaded blue; the orange dot marks the starting point where . The unclipped surrogate is the dashed grey line. The active branch of the — the value the gradient sees — is the thick green line. The clip caps the gain only in the direction that would push the policy further from : upward when , downward when . Adapted from Schulman et al. (2017).
It is worth walking through the four cases in Figure 2 explicitly, because the asymmetry is the entire trick:
| Case | vs band | Active branch | Effect |
|---|---|---|---|
| , | inside | normal policy-gradient signal | |
| , | already too high | (constant) | gradient is zero — stop pushing up |
| , | “safe” direction | clip not active — moving back toward | |
| , | inside | normal policy-gradient signal | |
| , | already too low | (constant) | gradient is zero — stop pushing down |
| , | “safe” direction | clip not active — moving back toward |
The pattern: when the action is “good” and the optimiser wants above 1, so the clip stops it from going above ; when the action is “bad” and the optimiser wants below 1, so the clip stops it from going below . In the opposite direction the clip is silent — moving back toward is always permitted, even when the ratio has wandered far. That asymmetry is what the encodes.
The Full PPO Loss¶
Just like A2C (10), PPO bundles a policy term, a value-regression term, and an entropy bonus into one scalar to differentiate:
The value loss regresses the critic onto bootstrapped returns , computed once at rollout time from the GAE estimator Schulman et al., 2016 introduced in the previous chapter. The entropy term plays the same exploration-preserving role as in A2C. Common defaults are and –0.01 — the original paper used for Atari Schulman et al., 2017; modern frameworks (SB3, CleanRL) default to 0.5 Huang et al., 2022.
The structural difference from A2C is not visible in (5). It is in how this loss is used.
The PPO Algorithm¶
A2C takes one gradient step per round of rollouts: collect transitions, compute one set of advantages, take one gradient step, throw the data away. PPO instead does epochs of mini-batch SGD on the same rollout buffer before discarding it. The clip is precisely what makes that data reuse safe — without it, repeated SGD steps on the surrogate would happily push to extreme values.
Each PPO iteration:
Roll out. Run the current policy in parallel environments for steps each. Store transitions along with the value predictions .
Compute advantages. Form GAE advantages and value targets for the whole buffer Schulman et al., 2016. The buffer is now frozen: nothing in steps 3–4 will recompute these.
Optimise. For epochs, shuffle the buffer into mini-batches and, for each mini-batch, take one gradient step on . The ratio is recomputed every step — its denominator stays fixed; only the numerator depends on the current .
Sync. Set . Discard the buffer and return to step 1.
Figure 3:The PPO loop. Each iteration produces one frozen rollout buffer (with GAE advantages), then runs epochs of mini-batch SGD on the same buffer (the dashed blue inner loop). Because is fixed for the duration of those epochs, the ratio has a stable denominator and the clip in (4) measures exactly how far the policy has drifted on each . Compare with the actor-critic flow: the new structural element is the inner K-epoch loop.
Typical hyperparameters from the original paper and the CleanRL reproductions Schulman et al., 2017Huang et al., 2022: environments, steps per rollout, epochs, mini-batches of size 32–512, clip , GAE , discount .
Why Clipping Works — A First-Order Trust Region¶
The clip is best understood as a first-order, separable approximation of TRPO’s KL constraint. TRPO’s trust region is a single KL ball in policy-distribution space, enforced by a constrained optimisation. PPO’s trust region is an axis-aligned box in ratio space — one constraint per — enforced by zeroing the gradient at the box boundary. Both bound how far the new policy can move from the old one before further gradient information is ignored; PPO trades the precision of an explicit KL ball for the simplicity of plain SGD.
Figure 4:A schematic of the three update styles in policy parameter space. Vanilla policy gradient (red) takes a step proportional to the gradient with no bound on distributional change and can leave the region where the linearised surrogate is trustworthy. TRPO (purple) caps the step at the boundary of an explicit KL ball via a constrained second-order solve. PPO-Clip (blue) caps the step at the boundary of an axis-aligned ratio band via a first-order clip in the loss — no constrained solver, no Fisher matrix.
Empirically the trade is good. Across the standard MuJoCo and Atari benchmarks, PPO matches or exceeds TRPO’s performance with a small fraction of the implementation complexity Schulman et al., 2017. The first-order formulation also adapts more naturally to architectures and training regimes where a Fisher-matrix approximation is awkward: recurrent policies (where the Fisher is defined over sequence distributions), very large parallel rollouts, and mixed-precision training.
Practical Details That Matter¶
PPO’s pseudocode is short, but a faithful re-implementation needs a handful of conventions that are not in the paper but matter substantially in practice. Engstrom et al. (2020) and Andrychowicz et al. (2021) showed in independent large-scale studies that several of these “code-level optimisations” account for as much of PPO’s measured advantage over TRPO as the clipped objective itself does, and Huang et al. (2022) enumerate 37 of them in the CleanRL-aligned blog post that has become the de-facto reference. The most impactful, kept here at the level of pointers:
Advantage normalisation. Standardise to mean 0, std 1 per mini-batch (not per full buffer) before forming the loss. Without it, the policy-loss scale drifts wildly across iterations and the chosen learning rate stops being meaningful. Normalising per buffer instead of per mini-batch introduces a mild data-leakage issue (the normalisation statistics see the whole buffer before any gradient step), and in practice per-mini-batch is what every major implementation uses.
Value-loss clipping. A symmetric clip on the change in value predictions mirrors the policy clip and stabilises the critic when the value head has high capacity. Concretely, the clipped value target is , and the value loss uses . Originally from the OpenAI Baselines codebase[1]; documented as detail #9 in Huang et al. (2022).
Orthogonal initialisation, small final-layer scale. Orthogonal weights with a small gain on the final policy head keep near uniform at initialisation, so early gradients are well-behaved.
Reward and observation normalisation. Running statistics of returns (and sometimes observations) are used to scale rewards before training. The effect is essentially a learning-rate normalisation across environments with different reward magnitudes.
Linear learning-rate annealing to zero over training, and a small that decays to zero late in training.
Generalised advantage estimation with — the same estimator we already saw in actor-critic, but now consistently used by every PPO implementation.
KL early-stopping (optional). Some implementations add a guard: if the mean KL between and exceeds a threshold (e.g., ) at any point during the epochs, the inner loop is terminated early. This makes PPO robust to cases where the clip alone is not sufficient (e.g., very high-advantage transitions with a large mini-batch).
These are not pedagogical fluff. Engstrom et al. (2020) report that an “honest” PPO with the clipped objective but without the code-level optimisations performs comparably to vanilla TRPO; the gap PPO is famous for largely closes when the implementation tricks are equalised. The takeaway is the same one that runs through this course: in deep RL the algorithm and the engineering are inseparable, and any benchmark comparison that ignores the latter is suspect Henderson et al., 2018.
Empirical Evidence¶
PPO’s reputation as the on-policy default rests on its robustness across very different domains with one near-universal hyperparameter setting. The CleanRL benchmarks Huang et al., 2022 make this concrete: with a single PPO config (clip 0.2, 0.95, epochs, steps, learning rate annealed linearly), the same algorithm matches or beats published baselines on classic-control, MuJoCo continuous control, Atari pixel observations, and procedurally generated environments — see the per-environment learning curves at cleanRL’s WandB. That breadth is what made PPO the default backbone for downstream applications: the InstructGPT/RLHF pipeline used PPO with the KL-penalty variant and a fixed reference policy Ouyang et al., 2022 — an approach that remains common in production systems, even as newer methods such as GRPO and DPO have emerged.
Summary¶
PPO is the algorithm that made on-policy actor-critics reliable, sample-efficient, and easy to tune.
The step-size problem of vanilla policy gradients Kakade & Langford, 2002 is that a small parameter step can mean a large distributional step, and once leaves the trustworthy neighbourhood of the surrogate stops being meaningful.
TRPO Schulman et al., 2015 solves this with a hard KL constraint and a constrained second-order step. It works, but the implementation is heavy.
PPO-Clip Schulman et al., 2017 replaces the KL constraint with the clipped surrogate . The pessimistic ensures the clip only fires in the direction that would push the policy further from , giving a per- first-order trust region that costs nothing more than plain SGD.
The PPO algorithm reuses the same on-policy buffer for epochs of mini-batch SGD, with GAE Schulman et al., 2016 as the advantage estimator. The clip is what makes that reuse safe.
A handful of implementation details — advantage normalisation, value-loss clipping, orthogonal init, reward scaling, LR annealing — explain a substantial part of PPO’s measured advantage in benchmark papers Engstrom et al., 2020Andrychowicz et al., 2021Huang et al., 2022. Algorithm and engineering travel together.
PPO is the conceptual end-point of this course’s tour through on-policy methods: it is what you reach for when sample efficiency comes second to robustness and ease of tuning, when you have a fast simulator and many parallel environments, or when you want a known-quantity algorithm for downstream work like LLM fine-tuning. Next we will focus on improving the sample efficiency using a learned model of the environment.
- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv Preprint arXiv:1707.06347.
- Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., & Lowe, R. (2022). Training Language Models to Follow Instructions with Human Feedback. Advances in Neural Information Processing Systems (NeurIPS), 35, 27730–27744.
- Hugging Face. (2024). Hugging Face Deep Reinforcement Learning Course. Online course. https://huggingface.co/learn/deep-rl-course/en/unit0/introduction
- Kakade, S. M., & Langford, J. (2002). Approximately Optimal Approximate Reinforcement Learning. Proceedings of the 19th International Conference on Machine Learning (ICML), 267–274.
- Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., & Moritz, P. (2015). Trust Region Policy Optimization. Proceedings of the 32nd International Conference on Machine Learning (ICML), 37, 1889–1897.
- Schulman, J., Moritz, P., Levine, S., Jordan, M. I., & Abbeel, P. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. Proceedings of the 4th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1506.02438
- Huang, S., Dossa, R. F. J., Raffin, A., Kanervisto, A., & Wang, W. (2022). The 37 Implementation Details of Proximal Policy Optimization. ICLR Blog Track. https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
- Huang, S., Dossa, R. F. J., Ye, C., Braga, J., Chakraborty, D., Mehta, K., & Raffin, A. (2022). CleanRL: High-Quality Single-File Implementations of Deep Reinforcement Learning Algorithms. Journal of Machine Learning Research, 23(274), 1–18.
- Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Janoos, F., Rudolph, L., & Madry, A. (2020). Implementation Matters in Deep RL: A Case Study on PPO and TRPO. Proceedings of the 8th International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=r1etN1rtPB
- Andrychowicz, M., Raichuk, A., Stańczyk, P., Orsini, M., Girgin, S., Marinier, R., Hussenot, L., Geist, M., Pietquin, O., Michalski, M., Gelly, S., & Bachem, O. (2021). What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study. Proceedings of the 9th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2006.05990
- Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2018). Deep Reinforcement Learning That Matters. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). 10.1609/aaai.v32i1.11694