RL for LLMs

Every algorithm in this course so far has been aimed at an agent acting in an environment: a paddle in Breakout, a stone on a Go board, a robot imagining rollouts inside a learned world model. This final chapter turns to a domain that looks, at first glance, nothing like control — generating text with a large language model (LLM) — and shows that it is, in fact, a reinforcement learning problem in disguise. The same machinery we built around policy gradients and Proximal Policy Optimization (PPO) (Chapter 6) is what turned a raw next-token predictor into ChatGPT Ouyang et al., 2022, and what more recently produced a new class of reasoning models that teach themselves to think step by step.

A pretrained LLM is trained on one objective: predict the next token on a vast corpus of internet text. This yields astonishing knowledge but a poorly behaved assistant — the model continues text rather than following instructions, and it has no notion of which of its many plausible continuations a human would actually prefer. Supervised fine-tuning (SFT) on curated demonstrations closes part of the gap, but demonstrations are expensive and can only show the model what good looks like, never rank competing outputs or push beyond the quality of the human-written examples. The missing ingredient is a reward signal — and optimising a policy against a reward is exactly what RL is for.

First we frame text generation as a Markov decision process (MDP), so that the LLM becomes a policy and the rest of the course applies directly. We then cover Reinforcement Learning from Human Feedback (RLHF), the three-stage pipeline behind InstructGPT Ouyang et al., 2022 that made instruction-following assistants possible. Next we introduce Group Relative Policy Optimization (GRPO) Shao et al., 2024, a critic-free variant of PPO, and show how it drove the emergence of reasoning in DeepSeek-R1 Guo & others, 2025. Finally we look at Direct Preference Optimization (DPO) Rafailov et al., 2023, which reaches the same goal as RLHF without running an RL loop at all — and ask in what sense it is, or is not, reinforcement learning.

Slides for this chapter (open full screen).

Text Generation as a Reinforcement Learning Problem¶

To apply RL to a language model we need to read text generation as a sequential decision process. Given a prompt $x$ , the LLM produces a response one token at a time. The state at step $t$ is everything seen so far — the prompt plus the tokens already generated, $s_t = (x, w_1, \ldots, w_t)$ . The action is the next token $a_t = w_{t+1}$ , drawn from the model’s vocabulary. The policy is the language model itself, $\pi_\theta(a_t \mid s_t)$ with parameters $\theta$ : the conditional distribution over the next token given the context. The transition is trivial and deterministic — the chosen token is appended to the context, $s_{t+1} = (s_t, a_t)$ — so there is no environment dynamics to learn here, unlike the world models of Chapter 8.

Autoregressive decoding as an MDP. The policy \pi_\theta — the language model — samples a token a_t = w_{t+1} from the current state s_t; the transition simply appends it, growing the context. The episode ends when the response is complete, at which point a scalar reward r(x, y) scores the whole sequence y = (w_1, \ldots, w_{|y|}). — Figure 2:Autoregressive decoding as an MDP. The policy $\pi_\theta$ — the language model — samples a token $a_t = w_{t+1}$ from the current state $s_t$ ; the transition simply appends it, growing the context. The episode ends when the response is complete, at which point a scalar reward $r(x, y)$ scores the whole sequence $y = (w_1, \ldots, w_{|y|})$ .

The defining feature of this MDP is its reward structure. There is no per-token reward; quality is a property of the whole response. The episode runs until an end-of-sequence token (or a length limit), and only then does a scalar reward $r(x, y)$ judge the completed answer $y = (w_1, \ldots, w_{|y|})$ of length $|y|$ , as shown in Figure 2. This makes the problem a sparse, terminal-reward RL task with very long horizons (hundreds or thousands of tokens) and an enormous action space (the entire vocabulary at every step). It is precisely the regime where the policy-gradient methods of Chapters 4–6 are at home: we can sample complete responses, score them, and push up the probability of the good ones.

Where does the reward $r(x, y)$ come from? That is the central design question of this chapter. We will cover two settings: RLHF learns it from human preference comparisons; reasoning models often compute it from a verifier (does the final answer match the ground truth? does the code pass the tests?). The RL algorithm is largely the same in both cases — what changes is the source of the reward.

RLHF: Aligning Models with Human Preferences¶

The idea of optimising a policy against a learned reward model derived from human preference comparisons predates LLMs — it was demonstrated on simulated robotics and Atari by Christiano et al. Christiano et al., 2017 and adapted to text generation, first for stylistic control Ziegler et al., 2019 and then for summarisation Stiennon et al., 2020. InstructGPT Ouyang et al., 2022 assembled these pieces into the now-standard three-stage recipe that underlies ChatGPT and its successors, shown in Figure 3.

$The three stages of RLHF as used in InstructGPT. (1) Supervised fine-tuning produces \pi^{\mathrm{SFT}}, which also serves as the frozen reference \pi_{\mathrm{ref}}. (2) A reward model r_\phi is trained from human rankings of sampled outputs via a Bradley–Terry loss. (3) PPO optimises the policy to maximise r_\phi minus a KL (Kullback–Leibler) penalty that keeps it close to the reference.$

Figure 3:The three stages of RLHF as used in InstructGPT. (1) Supervised fine-tuning produces $\pi^{\mathrm{SFT}}$ , which also serves as the frozen reference $\pi_{\mathrm{ref}}$ . (2) A reward model $r_\phi$ is trained from human rankings of sampled outputs via a Bradley–Terry loss. (3) PPO optimises the policy to maximise $r_\phi$ minus a KL (Kullback–Leibler) penalty that keeps it close to the reference.

Stage 1 — Supervised fine-tuning¶

A pretrained base model is fine-tuned on a dataset of human-written demonstrations: prompts paired with high-quality responses. This is ordinary supervised learning (next-token prediction on curated data), and it produces a policy $\pi^{\mathrm{SFT}}$ that already follows instructions reasonably well. Crucially, this model plays a second role: a frozen copy of it becomes the reference policy $\pi_{\mathrm{ref}}$ used to anchor the later RL stage.

Stage 2 — Reward modelling from human preferences¶

Instead of asking humans to write ideal answers — slow and inconsistent — we ask them the much easier question of comparison: given a prompt $x$ and two candidate responses, which is better? For each prompt we sample several outputs from the SFT model and collect human rankings. A reward model $r_\phi(x, y)$ , typically the SFT network with its language-modelling head replaced by a scalar output, is then trained to assign higher scores to preferred responses.

The standard choice is the Bradley–Terry model of pairwise preference, under which the probability that the “winning” response $y_w$ is preferred to the “losing” one $y_l$ is $\sigma\!\big(r_\phi(x, y_w) - r_\phi(x, y_l)\big)$ . Fitting $r_\phi$ by maximum likelihood gives the loss

\mathcal{L}_{\mathrm{RM}}(\phi) = -\,\mathbb{E}_{(x,\,y_w,\,y_l)\sim\mathcal{D}_{\mathrm{pref}}} \Big[\log \sigma\big(r_\phi(x, y_w) - r_\phi(x, y_l)\big)\Big],

(1)

where $\sigma$ is the logistic sigmoid and $\mathcal{D}_{\mathrm{pref}}$ is the dataset of human preference triples $(x, y_w, y_l)$ . The reward model distils thousands of human comparisons into a differentiable function that can score any response — including responses the policy has never been shown — which is what lets the next stage push beyond the quality ceiling of the demonstration data.

Stage 3 — Policy optimisation with PPO¶

With a reward model in hand, the alignment problem becomes the RL problem of Figure 2: find a policy that produces high-reward responses. InstructGPT Ouyang et al., 2022 solves it with PPO (Chapter 6), treating each prompt as a one-step bandit — a single reward at the end, but many token-level actions along the way — that still unfolds as a long token-level MDP inside the response. The target is a KL-regularised reward objective

\max_{\pi_\theta}\; \mathbb{E}_{x\sim\mathcal{D}_{\mathrm{prompt}},\; y\sim\pi_\theta(\cdot\mid x)} \big[\, r_\phi(x, y) \,\big] \;-\; \beta\, \mathbb{D}_{\mathrm{KL}}\!\big(\pi_\theta(\cdot\mid x)\,\|\,\pi_{\mathrm{ref}}(\cdot\mid x)\big),

(2)

where $\pi_\theta$ is the language-model policy being updated (parameters $\theta$ , as in Text Generation as a Reinforcement Learning Problem); $\mathcal{D}_{\mathrm{prompt}}$ is the dataset of prompts on which RL is run (distinct from the preference set $\mathcal{D}_{\mathrm{pref}}$ of Stage 2); $x\sim\mathcal{D}_{\mathrm{prompt}}$ is a prompt; $y = (w_1, \ldots, w_{|y|})$ is a complete response sampled token by token from $\pi_\theta(\cdot\mid x)$ ; $r_\phi(x, y)$ is the scalar score from the Stage-2 reward model (parameters $\phi$ are frozen during RL); $\pi_{\mathrm{ref}}$ is the frozen reference policy from Stage 1 (a copy of $\pi^{\mathrm{SFT}}$ ); $\beta > 0$ is a hyperparameter trading human-preference reward against fidelity to the reference; and $\mathbb{D}_{\mathrm{KL}}(\pi_\theta \,\|\, \pi_{\mathrm{ref}})$ measures how far the optimised policy has drifted from the SFT model. For an autoregressive policy the KL decomposes over tokens as

\mathbb{D}_{\mathrm{KL}}\!\big(\pi_\theta(\cdot\mid x)\,\|\,\pi_{\mathrm{ref}}(\cdot\mid x)\big) = \mathbb{E}_{y\sim\pi_\theta(\cdot\mid x)}\!\left[\sum_{t=1}^{|y|} \log\frac{\pi_\theta(w_t\mid x, w_{<t})}{\pi_{\mathrm{ref}}(w_t\mid x, w_{<t})} \right].

(3)

The first term in (2) rewards completions the reward model likes; the second penalises moving off the SFT distribution, where $r_\phi$ is trustworthy. Without it the policy quickly learns to reward-hack — producing degenerate, repetitive, or off-distribution text that $r_\phi$ scores highly but humans dislike.

For clarity we present the objective in its core form. InstructGPT’s full recipe adds an optional pretraining-mixing term (the “PPO-ptx” variant) — a language-modelling loss on the pretraining corpus folded into (2) to curb the alignment tax, the regression on standard NLP benchmarks that pure RLHF can cause Ouyang et al., 2022. We omit it here, as it is orthogonal to the RL mechanics.

Per-token rewards. PPO needs a scalar reward at every time step (Chapters 5–6). Because $r_\phi$ judges only the whole response, InstructGPT assigns it at the final token and folds the KL penalty in at every token Ouyang et al., 2022:

\tilde{r}_t = \begin{cases} r_\phi(x, y) - \beta\, \mathrm{kl}_t, & t = |y|, \\[4pt] -\beta\, \mathrm{kl}_t, & t < |y|, \end{cases} \qquad \mathrm{kl}_t = \log\frac{\pi_\theta(w_t\mid s_t)}{\pi_{\mathrm{ref}}(w_t\mid s_t)},

(4)

where $s_t = (x, w_{<t})$ is the MDP state from Text Generation as a Reinforcement Learning Problem (with $w_{<t}$ the tokens before position $t$ ), $w_t$ is the token sampled at step $t$ , and $\mathrm{kl}_t$ is the per-token log-ratio whose sum over $t$ is the sequence KL in (3). Intermediate tokens receive only KL shaping; the learned preference reward lands once, when the response is complete.

Advantages and the full PPO loss. With per-token rewards $\tilde{r}_t$ in hand, training follows the actor–critic recipe of Chapters 5–6. A value network $V_\psi(s_t)$ — in InstructGPT, initialised from the reward model Ouyang et al., 2022 — estimates how much future (KL-adjusted) return remains from state $s_t$ . Generalized Advantage Estimation (GAE) Schulman et al., 2016 turns the reward stream into advantages $\hat{A}_t$ and return targets $\hat{R}_t$ (both defined in Chapter 6); the policy is then updated by maximising the clipped surrogate from Chapter 6, together with value and entropy terms:

\mathcal{L}^{\mathrm{RLHF}}(\theta, \psi) = -\underbrace{\mathbb{E}_{x,\, y\sim\pi_{\theta_{\mathrm{old}}}}\!\left[\frac{1}{|y|}\sum_{t=1}^{|y|} \min\!\big(\rho_t\, \hat{A}_t,\; \mathrm{clip}(\rho_t,\, 1-\epsilon,\, 1+\epsilon)\, \hat{A}_t\big) \right]}_{\text{policy term}} + c_V\, \underbrace{\mathbb{E}_t\!\left[(V_\psi(s_t) - \hat{R}_t)^2\right]}_{\text{value loss}} - c_H\, \underbrace{\mathbb{E}_t\!\left[H\bigl(\pi_\theta(\cdot\mid s_t)\bigr)\right]}_{\text{entropy bonus}}.

(5)

The expectation is over prompts $x\sim\mathcal{D}_{\mathrm{prompt}}$ and responses $y$ sampled from the old policy $\pi_{\theta_{\mathrm{old}}}$ before each PPO update (the usual importance-sampling setup from Chapter 6). For each token $t$ in response $y$ : $\rho_t = \pi_\theta(w_t\mid s_t)\,/\,\pi_{\theta_{\mathrm{old}}}(w_t\mid s_t)$ is the importance ratio; $\hat{A}_t$ is the GAE advantage computed from $\{\tilde{r}_t\}$ ; $\epsilon$ is PPO’s clip width (typically 0.2); $V_\psi$ is the critic with parameters $\psi$ ; $\hat{R}_t$ is the bootstrapped return target; $H(\pi_\theta(\cdot\mid s_t))$ is the policy entropy at $s_t$ ; and $c_V, c_H$ weight the auxiliary terms (often $c_V \in \{0.5, 1\}$ , $c_H \approx 0$ for LLMs). As in vanilla PPO, the rollout buffer is frozen and reused for $K$ epochs of mini-batch SGD before $\theta_{\mathrm{old}}\leftarrow\theta$ and fresh responses are sampled.

GRPO: Reinforcement Learning Without a Critic¶

PPO is an actor–critic method (Chapter 5): alongside the policy it trains a value network to estimate the baseline $V(s)$ used to form the advantage. For LLMs this critic is costly — it is typically another network the size of the policy, doubling memory and adding a separately-tuned learning problem — and it is awkward, because a reliable per-token value estimate is hard to learn when the reward only arrives at the very end of a long sequence.

Group Relative Policy Optimization (GRPO) Shao et al., 2024, introduced for the DeepSeekMath models, removes the critic entirely. The insight is that the only thing the baseline does is tell us whether an action was better than average; for a given prompt we can estimate “average” empirically by sampling a whole group of responses and comparing them to each other. For a prompt $x$ , GRPO samples $G$ responses $\{y_1, \ldots, y_G\}$ from the current policy, scores each with the reward $r_i = r(x, y_i)$ , and uses the group-normalised reward as the advantage shared by every token of response $i$ :^[1]

\hat{A}_i = \frac{r_i - \operatorname{mean}(\{r_1, \ldots, r_G\})}{\operatorname{std}(\{r_1, \ldots, r_G\})}.

(6)

Responses better than the group mean get a positive advantage and are reinforced; worse-than-average ones are suppressed. This is the classic REINFORCE baseline (Chapter 4) estimated by Monte Carlo within the group, rather than by a learned value function. The policy is then updated with the familiar PPO-style clipped objective, plus a KL penalty to a reference model:

\mathcal{J}_{\mathrm{GRPO}}(\theta) = \mathbb{E}_{x\sim\mathcal{D}_{\mathrm{prompt}},\;\{y_i\}\sim\pi_{\theta_{\mathrm{old}}}(\cdot\mid x)}\!\left[ \frac{1}{G}\sum_{i=1}^{G} \min\!\Big(\rho_i\, \hat{A}_i,\; \operatorname{clip}(\rho_i,\, 1-\epsilon,\, 1+\epsilon)\, \hat{A}_i\Big) \;-\; \beta\, \mathbb{D}_{\mathrm{KL}}\!\big(\pi_\theta \,\|\, \pi_{\mathrm{ref}}\big) \right],

(7)

where $\rho_i = \pi_\theta(y_i \mid x) / \pi_{\theta_{\mathrm{old}}}(y_i \mid x)$ is the importance ratio. We show (7) at the sequence level for clarity; in the DeepSeekMath implementation the clip and the KL penalty are applied per token, averaging over the tokens within each response. Compared with the full PPO loss (5), GRPO keeps the clip and the KL but drops the value and entropy terms and replaces the GAE advantage $\hat{A}_t$ with the group-relative one of (6).

This design is an especially good fit for tasks with verifiable rewards — mathematics, coding, logic — where a response can be checked automatically (right answer? tests pass?) and no learned reward model is needed at all. The reward is then a cheap, exact, hack-resistant signal, and group sampling provides a low-variance baseline for free. That combination is what makes GRPO the workhorse of modern reasoning models.

RL for Reasoning: DeepSeek-R1¶

The most striking recent demonstration of RL in LLMs is DeepSeek-R1 Guo & others, 2025, which showed that reasoning ability can be incentivised by reinforcement learning rather than imitated from human-written chains of thought. Figure 4 summarises the two models the paper introduces.

Top: DeepSeek-R1-Zero is trained with GRPO directly on the base model using only rule-based, verifiable rewards — and develops long chains of thought on its own. Bottom: DeepSeek-R1 wraps that reasoning RL in a four-stage pipeline (cold-start SFT, reasoning RL, rejection-sampling SFT, and a final broad RL stage) to fix readability and broaden the model’s behaviour. — Figure 4:Top: **DeepSeek-R1-Zero** is trained with GRPO directly on the base model using only rule-based, verifiable rewards — and develops long chains of thought on its own. Bottom: **DeepSeek-R1** wraps that reasoning RL in a four-stage pipeline (cold-start SFT, reasoning RL, rejection-sampling SFT, and a final broad RL stage) to fix readability and broaden the model’s behaviour.

R1-Zero: reasoning from pure RL¶

DeepSeek-R1-Zero starts from the DeepSeek-V3 base model and applies GRPO with no supervised fine-tuning at all — skipping Stage 1 of RLHF entirely. The reward is purely rule-based: an accuracy reward that checks whether the final answer is correct (for maths problems with known solutions, or code that passes tests) and a format reward that requires the model to place its reasoning inside designated <think> tags. There is no learned reward model and therefore little room for reward hacking.

Trained against this signal alone, the model spontaneously learns to generate longer and longer chains of thought, allocating more inference-time computation to harder problems, and exhibits emergent behaviours such as re-checking its own work and exploring alternative approaches — what the authors memorably call an “aha moment.” This is a notable result: complex reasoning strategies emerged from a scalar correctness reward and group-relative policy gradients, without ever being shown a human example of how to reason. The cost is that R1-Zero’s output is often hard to read — it mixes languages and produces unpolished, sometimes chaotic text — because the reward cares only about the final answer, not the legibility of the path to it.

R1: a multi-stage pipeline¶

To keep the reasoning gains while producing a usable assistant, DeepSeek-R1 surrounds the RL with additional stages. A small cold-start SFT on a few thousand high-quality long-chain-of-thought examples gives the model a readable starting style. Reasoning-oriented RL (GRPO again) then sharpens its problem-solving, now with an added language-consistency reward to curb the mixing seen in R1-Zero. Rejection sampling harvests the best responses from that RL checkpoint and combines them with general-purpose data for another round of SFT. A final RL stage across diverse prompts tunes for helpfulness and harmlessness. The result matches frontier proprietary reasoning models on maths and coding benchmarks — on the competition-maths benchmarks AIME 2024 and MATH-500, DeepSeek-R1 reaches 79.8% and 97.3% pass@1 (the fraction solved correctly on the first attempt), on par with OpenAI’s o1 — while remaining a coherent assistant, and, importantly, the recipe and weights were released openly.

Direct Preference Optimization (DPO)¶

RLHF is powerful but operationally heavy: it requires training a separate reward model and then running an on-policy RL loop that repeatedly samples from the model, scores those samples, and updates — sensitive to many hyperparameters. Direct Preference Optimization (DPO) Rafailov et al., 2023 asks whether the RL stage can be skipped while optimising the same objective. Remarkably, it can.

The starting point is a known fact about the KL-regularised objective (2): for a fixed reward $r$ , it has a closed-form optimal policy,

\pi^{*}(y \mid x) = \frac{1}{Z(x)}\, \pi_{\mathrm{ref}}(y \mid x)\, \exp\!\Big(\tfrac{1}{\beta}\, r(x, y)\Big),

(8)

where $Z(x)$ is a normalising constant. DPO’s key move is to invert this relationship: solving (8) for the reward expresses it in terms of the optimal policy and the reference,

r(x, y) = \beta \log \frac{\pi^{*}(y \mid x)}{\pi_{\mathrm{ref}}(y \mid x)} + \beta \log Z(x).

(9)

Substituting this implicit reward into the Bradley–Terry preference likelihood (1) makes the intractable $Z(x)$ cancel (it is the same for $y_w$ and $y_l$ ), leaving a loss that depends only on the policy being trained and the frozen reference:

\mathcal{L}_{\mathrm{DPO}}(\theta) = -\,\mathbb{E}_{(x,\,y_w,\,y_l)\sim\mathcal{D}_{\mathrm{pref}}} \left[ \log \sigma\!\left( \beta \log \frac{\pi_\theta(y_w \mid x)}{\pi_{\mathrm{ref}}(y_w \mid x)} - \beta \log \frac{\pi_\theta(y_l \mid x)}{\pi_{\mathrm{ref}}(y_l \mid x)} \right) \right].

(10)

This is an ordinary supervised classification loss on preference pairs: it simply increases the policy’s relative log-probability of preferred responses over dispreferred ones, scaled by $\beta$ and anchored to the reference. No reward model is trained, and no responses are sampled during training. Figure 5 contrasts the two routes.

Figure 5:RLHF (top) trains an explicit reward model and then optimises the policy with an online RL loop. DPO (bottom) collapses both stages into a single offline loss on preference pairs, with the reward implicit in the policy–reference log-ratio. Both target the same KL-regularised objective.

So is DPO reinforcement learning? It optimises exactly the RLHF objective of (2), and the policy–reference log-ratio in (10) is the implicit reward — so in that sense it inherits the RL formulation. But mechanically it is supervised learning: a closed-form, offline loss over a fixed preference dataset, with no sampling, no reward model, and no exploration. That trade-off has consequences. DPO is far simpler and more stable to train, which has made it the default for preference tuning at moderate scale. But because it never samples from the current policy, it cannot discover and reinforce new high-reward behaviours the way an online RL loop can — it can only re-weight responses present in the preference data. This is precisely why reasoning models, which need to explore long solution paths and be rewarded by a verifier, rely on online methods like GRPO rather than DPO.

Table 1 consolidates the three approaches along the axes that distinguish them.

Table 1:The three preference-optimisation routes of this chapter at a glance. All target the same KL-regularised objective (2); they differ in machinery and in what they can learn.

	RLHF (PPO)	GRPO	DPO
Critic / value network	Yes	No — group baseline	No
Reward source	Learned reward model	Verifier or reward model	Implicit (policy–reference log-ratio)
Online / offline	Online (samples each update)	Online (samples a group)	Offline (fixed preference data)
Explores new behaviours?	Yes	Yes	No — re-weights existing data
KL anchor to reference	Yes (explicit penalty)	Yes (explicit penalty)	Yes (implicit in the loss)

Summary¶

This final chapter showed that text generation is a reinforcement learning problem in disguise. Autoregressive decoding is a Markov decision process — the state is the prompt plus tokens generated so far, each next token is an action, and quality is judged by a sparse terminal reward — so the policy-gradient methods of Chapters 4–6 apply directly, albeit with very long horizons and an enormous action space at every step.

The dominant alignment recipe is RLHF Ouyang et al., 2022: supervised fine-tuning on demonstrations, a Bradley–Terry reward model trained from human preference comparisons, and PPO against a KL-regularised objective that rewards high-scoring completions while penalising drift from a frozen reference policy. The KL anchor is essential; without it the policy exploits imperfections in the learned reward model rather than genuinely improving.

GRPO Shao et al., 2024 offers a critic-free alternative: sample a group of responses per prompt, normalise their rewards within the group to form advantages, and update with the same PPO clip and KL penalty. This is especially effective when rewards are verifiable — as in DeepSeek-R1 Guo & others, 2025, where rule-based correctness and format rewards alone were enough to incentivise long chains of thought, including in R1-Zero with no supervised fine-tuning at all. Finally, DPO Rafailov et al., 2023 reaches the same KL-regularised objective offline, as a supervised loss on preference pairs, but sacrifices the online exploration that reasoning RL depends on.