Proximal Policy Optimization - Introduction to Deep Reinforcement Learning

The previous chapter closed with a forward pointer: A2C is on-policy, has no built-in safety net on how far one gradient step can move the policy, and so it needs careful learning-rate tuning to keep training stable. This chapter takes that observation seriously and builds the algorithm that fixed it: Proximal Policy Optimization (PPO) Schulman et al., 2017. PPO replaces the implicit “small step in parameter space” of vanilla policy gradients with an explicit bound on policy change in distribution space — and does so with a clipped surrogate objective that costs nothing more than first-order stochastic gradient descent (SGD). The result is the on-policy default in deep RL: the algorithm behind robotic manipulation, video-game agents, and the Reinforcement Learning from Human Feedback (RLHF) stage of large language model training Ouyang et al., 2022Hugging Face, 2024.

Slides for this chapter (open full screen).

The Step-Size Problem in Policy Gradients¶

The actor-critic update (6) takes one SGD step on the policy parameters $\theta$ in the direction $\nabla_\theta \log\pi_\theta(a_t\mid s_t)\, \hat A_t$ . Whether that step is good depends not on its norm in parameter space, but on whether the resulting policy distribution $\pi_{\theta+\Delta\theta}$ is still close to the policy $\pi_\theta$ that generated the data. Two facts conspire to make this delicate:

The gradient is computed under $\pi_{\theta_\text{old}}$ . The local linearisation of the expected return — the surrogate that policy gradient SGD is implicitly maximising — is only trustworthy in a neighbourhood where $\pi_\theta$ has not yet moved much from $\pi_{\theta_\text{old}}$ . Beyond that neighbourhood, the linearisation is just a guess Kakade & Langford, 2002.
A single gradient step does not bound distributional change. A small parameter step can correspond to a large change in $\pi_\theta(\cdot\mid s)$ — for example, a logit jumping from 1.0 to 4.0 collapses a near-uniform softmax to a near-deterministic one.

The consequence is the classic on-policy failure mode: one update pushes $\pi_\theta$ outside the region where the surrogate is trustworthy, the next batch of data is collected under that degraded policy, the gradient computed on that data points somewhere unhelpful, and recovery is slow or impossible. Vanilla on-policy methods cope with this only through small learning rates and careful schedules — A2C’s “tune $\alpha$ per environment” workflow is exactly the symptom.

Trust Regions and the Surrogate Objective¶

Conservative policy iteration Kakade & Langford, 2002 motivates a different way of writing the policy-gradient objective. Instead of differentiating the expected return directly, define a surrogate that uses importance sampling to score a candidate policy $\pi_\theta$ on data collected by $\pi_{\theta_\text{old}}$ :

L^{\text{CPI}}(\theta) = \mathbb{E}_{(s,a)\sim\pi_{\theta_\text{old}}}\!\left[ \frac{\pi_\theta(a\mid s)}{\pi_{\theta_\text{old}}(a\mid s)}\, \hat A_t \right] = \mathbb{E}_t\!\left[r_t(\theta)\, \hat A_t\right],

(1)

where the probability ratio

r_t(\theta) = \frac{\pi_\theta(a_t\mid s_t)}{\pi_{\theta_\text{old}}(a_t\mid s_t)}

(2)

equals 1 when $\theta = \theta_\text{old}$ , so $L^{\text{CPI}}$ has the same value and gradient as the on-policy expected return at $\theta_\text{old}$ . Maximising $L^{\text{CPI}}$ over many steps without any bound, however, would happily push $r_t(\theta)$ to enormous values for actions with $\hat A_t > 0$ — exactly the step-size problem above, now expressed in ratio form.

To prevent the policy from changing too much with each update, it’s helpful to introduce a measure of how different two probability distributions are. The Kullback–Leibler (KL) divergence is commonly used for this purpose: it quantifies the “distance” between the old and new policies by comparing their output distributions on the same states.

Trust Region Policy Optimization (TRPO) Schulman et al., 2015 incorporates this idea by maximizing $L^{\text{CPI}}$ while enforcing an explicit constraint on the average KL divergence between the old policy $\pi_{\theta_\text{old}}$ and the new policy $\pi_\theta$ .

\max_{\theta}\; L^{\text{CPI}}(\theta) \quad \text{subject to} \quad \mathbb{E}_{s\sim\pi_{\theta_\text{old}}}\!\left[\mathrm{KL}\bigl(\pi_{\theta_\text{old}}(\cdot\mid s)\,\|\,\pi_\theta(\cdot\mid s)\bigr)\right] \le \delta.

(3)

The KL ball $\{\theta : \mathbb{E}[\mathrm{KL}] \le \delta\}$ is the trust region: outside it, the linearised surrogate may no longer reflect the true objective, so the constraint forbids the optimiser from going there.

PPO-Clip — The Central Idea¶

Enforcing (3) exactly requires the machinery described above — conjugate gradients, Fisher approximations, a line search per update. That leaves a practical hole: trust-region logic is invaluable, yet the implementation fights the grain of mainstream autodiff. Each update computes a natural policy gradient direction: not a naive step in Euclidean $\theta$ -space but one weighted by curvature of $\log\pi_\theta$ with respect to parameters (usually implemented via matrix–vector products with the Fisher information matrix). The optimiser must then line search: shorten the attempted step repeatedly and re-evaluate KL until the update truly lies inside the trust region. Inner linear solves and KL checks woven around plain backprop clash with how neural-network code is ordinarily structured. PPO-Clip is motivated as a way to approximate the same pessimism toward large swings in $\pi_\theta$ using only an ordinary differentiable loss.

PPO replaces TRPO’s hard KL constraint with a soft, first-order bound baked directly into the loss. The clipped surrogate is

L^{\text{CLIP}}(\theta) = \mathbb{E}_t\!\left[ \min\!\Bigl(\, r_t(\theta)\, \hat A_t,\; \text{clip}\bigl(r_t(\theta),\, 1-\epsilon,\, 1+\epsilon\bigr)\,\hat A_t \Bigr) \right],

(4)

where $\epsilon$ is a small constant (typically 0.1–0.3; the original paper recommends 0.2 Schulman et al., 2017). Two pieces are at play:

The $\textbf{clip}(r, 1-\epsilon, 1+\epsilon)$ prevents the ratio from contributing a gradient if it has already moved outside the band $[1-\epsilon,\, 1+\epsilon]$ — that is, if the new policy has already changed too much on this $(s,a)$ .
The pessimistic min takes the smaller of the unclipped and clipped surrogate values. Crucially, this means the clip only activates in the direction that improves the unclipped objective. Moves that worsen the unclipped objective stay unclipped, so the optimiser never gets to “save” a bad step by clipping it away.

The PPO clipped surrogate L^{\text{CLIP}} as a function of the probability ratio r_t(\theta), for the two signs of the advantage. The trust band r_t \in [1-\epsilon, 1+\epsilon] is shaded blue; the orange dot marks the starting point r_t = 1 where \theta = \theta_\text{old}. The unclipped surrogate r_t \hat A_t is the dashed grey line. The active branch of the \min — the value the gradient sees — is the thick green line. The clip caps the gain only in the direction that would push the policy further from \pi_{\theta_\text{old}}: upward when \hat A_t > 0, downward when \hat A_t < 0. Adapted from . — Figure 2:The PPO clipped surrogate $L^{\text{CLIP}}$ as a function of the probability ratio $r_t(\theta)$ , for the two signs of the advantage. The trust band $r_t \in [1-\epsilon, 1+\epsilon]$ is shaded blue; the orange dot marks the starting point $r_t = 1$ where $\theta = \theta_\text{old}$ . The unclipped surrogate $r_t \hat A_t$ is the dashed grey line. The active branch of the $\min$ — the value the gradient sees — is the thick green line. The clip *caps the gain* only in the direction that would push the policy further from $\pi_{\theta_\text{old}}$ : upward when $\hat A_t > 0$ , downward when $\hat A_t < 0$ . Adapted from Schulman *et al.* (2017).

It is worth walking through the four cases in Figure 2 explicitly, because the asymmetry is the entire trick:

Case	$r_t$ vs band	Active branch	Effect
$\hat A_t > 0$ , $r_t \in [1-\epsilon, 1+\epsilon]$	inside	$r_t \hat A_t$	normal policy-gradient signal
$\hat A_t > 0$ , $r_t > 1+\epsilon$	already too high	$(1+\epsilon)\hat A_t$ (constant)	gradient is zero — stop pushing up
$\hat A_t > 0$ , $r_t < 1-\epsilon$	“safe” direction	$r_t \hat A_t$	clip not active — moving back toward $r_t=1$
$\hat A_t < 0$ , $r_t \in [1-\epsilon, 1+\epsilon]$	inside	$r_t \hat A_t$	normal policy-gradient signal
$\hat A_t < 0$ , $r_t < 1-\epsilon$	already too low	$(1-\epsilon)\hat A_t$ (constant)	gradient is zero — stop pushing down
$\hat A_t < 0$ , $r_t > 1+\epsilon$	“safe” direction	$r_t \hat A_t$	clip not active — moving back toward $r_t=1$

The pattern: when $\hat A_t > 0$ the action is “good” and the optimiser wants $r_t$ above 1, so the clip stops it from going above $1+\epsilon$ ; when $\hat A_t < 0$ the action is “bad” and the optimiser wants $r_t$ below 1, so the clip stops it from going below $1-\epsilon$ . In the opposite direction the clip is silent — moving back toward $r_t = 1$ is always permitted, even when the ratio has wandered far. That asymmetry is what the $\min$ encodes.

The Full PPO Loss¶

Just like A2C (10), PPO bundles a policy term, a value-regression term, and an entropy bonus into one scalar to differentiate:

\mathcal{L}^{\text{PPO}}(\theta, w) = -\underbrace{L^{\text{CLIP}}(\theta)}_{\text{policy term}} + c_V\, \underbrace{\mathbb{E}_t\!\left[(V_w(s_t) - \hat R_t)^2\right]}_{\text{value loss}} - c_H\, \underbrace{\mathbb{E}_t\!\left[H\bigl(\pi_\theta(\cdot\mid s_t)\bigr)\right]}_{\text{entropy bonus}}.

(5)

The value loss regresses the critic onto bootstrapped returns $\hat R_t$ , computed once at rollout time from the GAE estimator Schulman et al., 2016 introduced in the previous chapter. The entropy term plays the same exploration-preserving role as in A2C. Common defaults are $c_V \in \{0.5, 1\}$ and $c_H \approx 0.0$ –0.01 — the original paper used $c_V = 1$ for Atari Schulman et al., 2017; modern frameworks (SB3, CleanRL) default to 0.5 Huang et al., 2022.

The structural difference from A2C is not visible in (5). It is in how this loss is used.

The PPO Algorithm¶

A2C takes one gradient step per round of rollouts: collect $N \times n$ transitions, compute one set of advantages, take one gradient step, throw the data away. PPO instead does $K$ epochs of mini-batch SGD on the same rollout buffer before discarding it. The clip is precisely what makes that data reuse safe — without it, repeated SGD steps on the surrogate $L^{\text{CPI}}$ would happily push $r_t$ to extreme values.

Each PPO iteration:

Roll out. Run the current policy $\pi_{\theta_\text{old}}$ in $N$ parallel environments for $T$ steps each. Store transitions $(s_t, a_t, \log\pi_{\theta_\text{old}}(a_t\mid s_t), r_{t+1}, s_{t+1})$ along with the value predictions $V_w(s_t)$ .
Compute advantages. Form GAE advantages $\hat A_t$ and value targets $\hat R_t = \hat A_t + V_w(s_t)$ for the whole buffer Schulman et al., 2016. The buffer is now frozen: nothing in steps 3–4 will recompute these.
Optimise. For $K$ epochs, shuffle the buffer into mini-batches and, for each mini-batch, take one gradient step on $\mathcal{L}^{\text{PPO}}(\theta, w)$ . The ratio $r_t(\theta)$ is recomputed every step — its denominator $\log\pi_{\theta_\text{old}}$ stays fixed; only the numerator depends on the current $\theta$ .
Sync. Set $\theta_\text{old} \leftarrow \theta$ . Discard the buffer and return to step 1.

The PPO loop. Each iteration produces one frozen rollout buffer (with GAE advantages), then runs K epochs of mini-batch SGD on the same buffer (the dashed blue inner loop). Because \pi_{\theta_\text{old}} is fixed for the duration of those epochs, the ratio r_t(\theta) has a stable denominator and the clip in measures exactly how far the policy has drifted on each (s,a). Compare with the actor-critic flow: the new structural element is the inner K-epoch loop. — Figure 3:The PPO loop. Each iteration produces one frozen rollout buffer (with GAE advantages), then runs $K$ epochs of mini-batch SGD on the *same* buffer (the dashed blue inner loop). Because $\pi_{\theta_\text{old}}$ is fixed for the duration of those epochs, the ratio $r_t(\theta)$ has a stable denominator and the clip in (4) measures *exactly* how far the policy has drifted on each $(s,a)$ . Compare with the actor-critic flow: the new structural element is the inner `K-epoch` loop.

Typical hyperparameters from the original paper and the CleanRL reproductions Schulman et al., 2017Huang et al., 2022: $N\!\in\![8, 128]$ environments, $T\!\in\![128, 2048]$ steps per rollout, $K\!\in\![3, 10]$ epochs, mini-batches of size 32–512, clip $\epsilon \approx 0.2$ , GAE $\lambda \approx 0.95$ , discount $\gamma \approx 0.99$ .

Why Clipping Works — A First-Order Trust Region¶

The clip is best understood as a first-order, separable approximation of TRPO’s KL constraint. TRPO’s trust region is a single KL ball in policy-distribution space, enforced by a constrained optimisation. PPO’s trust region is an axis-aligned box in ratio space — one constraint per $(s, a)$ — enforced by zeroing the gradient at the box boundary. Both bound how far the new policy can move from the old one before further gradient information is ignored; PPO trades the precision of an explicit KL ball for the simplicity of plain SGD.

A schematic of the three update styles in policy parameter space. Vanilla policy gradient (red) takes a step proportional to the gradient with no bound on distributional change and can leave the region where the linearised surrogate is trustworthy. TRPO (purple) caps the step at the boundary of an explicit KL ball \mathbb E[\mathrm{KL}] \le \delta via a constrained second-order solve. PPO-Clip (blue) caps the step at the boundary of an axis-aligned ratio band r_t \in [1-\epsilon, 1+\epsilon] via a first-order clip in the loss — no constrained solver, no Fisher matrix. — Figure 4:A schematic of the three update styles in policy parameter space. Vanilla policy gradient (red) takes a step proportional to the gradient with no bound on distributional change and can leave the region where the linearised surrogate is trustworthy. TRPO (purple) caps the step at the boundary of an explicit KL ball $\mathbb E[\mathrm{KL}] \le \delta$ via a constrained second-order solve. PPO-Clip (blue) caps the step at the boundary of an axis-aligned ratio band $r_t \in [1-\epsilon, 1+\epsilon]$ via a first-order clip in the loss — no constrained solver, no Fisher matrix.

Empirically the trade is good. Across the standard MuJoCo and Atari benchmarks, PPO matches or exceeds TRPO’s performance with a small fraction of the implementation complexity Schulman et al., 2017. The first-order formulation also adapts more naturally to architectures and training regimes where a Fisher-matrix approximation is awkward: recurrent policies (where the Fisher is defined over sequence distributions), very large parallel rollouts, and mixed-precision training.

Practical Details That Matter¶

PPO’s pseudocode is short, but a faithful re-implementation needs a handful of conventions that are not in the paper but matter substantially in practice. Engstrom et al. (2020) and Andrychowicz et al. (2021) showed in independent large-scale studies that several of these “code-level optimisations” account for as much of PPO’s measured advantage over TRPO as the clipped objective itself does, and Huang et al. (2022) enumerate 37 of them in the CleanRL-aligned blog post that has become the de-facto reference. The most impactful, kept here at the level of pointers:

Advantage normalisation. Standardise $\hat A_t$ to mean 0, std 1 per mini-batch (not per full buffer) before forming the loss. Without it, the policy-loss scale drifts wildly across iterations and the chosen learning rate stops being meaningful. Normalising per buffer instead of per mini-batch introduces a mild data-leakage issue (the normalisation statistics see the whole buffer before any gradient step), and in practice per-mini-batch is what every major implementation uses.
Value-loss clipping. A symmetric clip on the change in value predictions mirrors the policy clip and stabilises the critic when the value head has high capacity. Concretely, the clipped value target is $\hat V_t^{\text{clip}} = V_{w_\text{old}}(s_t) + \text{clip}(V_w(s_t) - V_{w_\text{old}}(s_t),\, -\epsilon,\, +\epsilon)$ , and the value loss uses $\max\!\bigl((V_w(s_t)-\hat R_t)^2,\,(\hat V_t^{\text{clip}}-\hat R_t)^2\bigr)$ . Originally from the OpenAI Baselines codebase^[1]; documented as detail #9 in Huang et al. (2022).
Orthogonal initialisation, small final-layer scale. Orthogonal weights with a small gain on the final policy head keep $\pi_\theta$ near uniform at initialisation, so early gradients are well-behaved.
Reward and observation normalisation. Running statistics of returns (and sometimes observations) are used to scale rewards before training. The effect is essentially a learning-rate normalisation across environments with different reward magnitudes.
Linear learning-rate annealing to zero over training, and a small $c_H$ that decays to zero late in training.
Generalised advantage estimation with $\lambda \approx 0.95$ — the same estimator we already saw in actor-critic, but now consistently used by every PPO implementation.
KL early-stopping (optional). Some implementations add a guard: if the mean KL between $\pi_\theta$ and $\pi_{\theta_\text{old}}$ exceeds a threshold (e.g., $1.5 \times \delta_\text{target}$ ) at any point during the $K$ epochs, the inner loop is terminated early. This makes PPO robust to cases where the clip alone is not sufficient (e.g., very high-advantage transitions with a large mini-batch).

These are not pedagogical fluff. Engstrom et al. (2020) report that an “honest” PPO with the clipped objective but without the code-level optimisations performs comparably to vanilla TRPO; the gap PPO is famous for largely closes when the implementation tricks are equalised. The takeaway is the same one that runs through this course: in deep RL the algorithm and the engineering are inseparable, and any benchmark comparison that ignores the latter is suspect Henderson et al., 2018.

Empirical Evidence¶

PPO’s reputation as the on-policy default rests on its robustness across very different domains with one near-universal hyperparameter setting. The CleanRL benchmarks Huang et al., 2022 make this concrete: with a single PPO config (clip 0.2, $\lambda$ 0.95, $K=4$ epochs, $T=128$ steps, learning rate $2.5\times10^{-4}$ annealed linearly), the same algorithm matches or beats published baselines on classic-control, MuJoCo continuous control, Atari pixel observations, and procedurally generated environments — see the per-environment learning curves at cleanRL’s WandB. That breadth is what made PPO the default backbone for downstream applications: the InstructGPT/RLHF pipeline used PPO with the KL-penalty variant and a fixed reference policy Ouyang et al., 2022 — an approach that remains common in production systems, even as newer methods such as GRPO and DPO have emerged.

Summary¶

PPO is the algorithm that made on-policy actor-critics reliable, sample-efficient, and easy to tune.

The step-size problem of vanilla policy gradients Kakade & Langford, 2002 is that a small parameter step can mean a large distributional step, and once $\pi_\theta$ leaves the trustworthy neighbourhood of $\pi_{\theta_\text{old}}$ the surrogate stops being meaningful.
TRPO Schulman et al., 2015 solves this with a hard KL constraint and a constrained second-order step. It works, but the implementation is heavy.
PPO-Clip Schulman et al., 2017 replaces the KL constraint with the clipped surrogate $L^{\text{CLIP}}$ . The pessimistic $\min$ ensures the clip only fires in the direction that would push the policy further from $\pi_{\theta_\text{old}}$ , giving a per- $(s,a)$ first-order trust region that costs nothing more than plain SGD.
The PPO algorithm reuses the same on-policy buffer for $K$ epochs of mini-batch SGD, with GAE Schulman et al., 2016 as the advantage estimator. The clip is what makes that reuse safe.
A handful of implementation details — advantage normalisation, value-loss clipping, orthogonal init, reward scaling, LR annealing — explain a substantial part of PPO’s measured advantage in benchmark papers Engstrom et al., 2020Andrychowicz et al., 2021Huang et al., 2022. Algorithm and engineering travel together.

PPO is the conceptual end-point of this course’s tour through on-policy methods: it is what you reach for when sample efficiency comes second to robustness and ease of tuning, when you have a fast simulator and many parallel environments, or when you want a known-quantity algorithm for downstream work like LLM fine-tuning. Next we will focus on improving the sample efficiency using a learned model of the environment.

Footnotes¶

https://github.com/openai/baselines
↩

References¶

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv Preprint arXiv:1707.06347.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., & Lowe, R. (2022). Training Language Models to Follow Instructions with Human Feedback. Advances in Neural Information Processing Systems (NeurIPS), 35, 27730–27744.
Hugging Face. (2024). Hugging Face Deep Reinforcement Learning Course. Online course. https://huggingface.co/learn/deep-rl-course/en/unit0/introduction
Kakade, S. M., & Langford, J. (2002). Approximately Optimal Approximate Reinforcement Learning. Proceedings of the 19th International Conference on Machine Learning (ICML), 267–274.
Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., & Moritz, P. (2015). Trust Region Policy Optimization. Proceedings of the 32nd International Conference on Machine Learning (ICML), 37, 1889–1897.
Schulman, J., Moritz, P., Levine, S., Jordan, M. I., & Abbeel, P. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. Proceedings of the 4th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1506.02438
Huang, S., Dossa, R. F. J., Raffin, A., Kanervisto, A., & Wang, W. (2022). The 37 Implementation Details of Proximal Policy Optimization. ICLR Blog Track. https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
Huang, S., Dossa, R. F. J., Ye, C., Braga, J., Chakraborty, D., Mehta, K., & Raffin, A. (2022). CleanRL: High-Quality Single-File Implementations of Deep Reinforcement Learning Algorithms. Journal of Machine Learning Research, 23(274), 1–18.
Engstrom, L., Ilyas, A., Santurkar, S., Tsipras, D., Janoos, F., Rudolph, L., & Madry, A. (2020). Implementation Matters in Deep RL: A Case Study on PPO and TRPO. Proceedings of the 8th International Conference on Learning Representations (ICLR). https://openreview.net/forum?id=r1etN1rtPB
Andrychowicz, M., Raichuk, A., Stańczyk, P., Orsini, M., Girgin, S., Marinier, R., Hussenot, L., Geist, M., Pietquin, O., Michalski, M., Gelly, S., & Bachem, O. (2021). What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study. Proceedings of the 9th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2006.05990
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., & Meger, D. (2018). Deep Reinforcement Learning That Matters. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). 10.1609/aaai.v32i1.11694