RL for LLMs
Align Large Language Models to human preferences and make them reason
Every algorithm in this course so far has been aimed at an agent acting in an environment: a paddle in Breakout, a stone on a Go board, a robot imagining rollouts inside a learned world model. This final chapter turns to a domain that looks, at first glance, nothing like control — generating text with a large language model (LLM) — and shows that it is, in fact, a reinforcement learning problem in disguise. The same machinery we built around policy gradients and Proximal Policy Optimization (PPO) (Chapter 6) is what turned a raw next-token predictor into ChatGPT Ouyang et al., 2022, and what more recently produced a new class of reasoning models that teach themselves to think step by step.
A pretrained LLM is trained on one objective: predict the next token on a vast corpus of internet text. This yields astonishing knowledge but a poorly behaved assistant — the model continues text rather than following instructions, and it has no notion of which of its many plausible continuations a human would actually prefer. Supervised fine-tuning (SFT) on curated demonstrations closes part of the gap, but demonstrations are expensive and can only show the model what good looks like, never rank competing outputs or push beyond the quality of the human-written examples. The missing ingredient is a reward signal — and optimising a policy against a reward is exactly what RL is for.
First we frame text generation as a Markov decision process (MDP), so that the LLM becomes a policy and the rest of the course applies directly. We then cover Reinforcement Learning from Human Feedback (RLHF), the three-stage pipeline behind InstructGPT Ouyang et al., 2022 that made instruction-following assistants possible. Next we introduce Group Relative Policy Optimization (GRPO) Shao et al., 2024, a critic-free variant of PPO, and show how it drove the emergence of reasoning in DeepSeek-R1 Guo & others, 2025. Finally we look at Direct Preference Optimization (DPO) Rafailov et al., 2023, which reaches the same goal as RLHF without running an RL loop at all — and ask in what sense it is, or is not, reinforcement learning.
Slides for this chapter (open full screen).
Text Generation as a Reinforcement Learning Problem¶
To apply RL to a language model we need to read text generation as a sequential decision process. Given a prompt , the LLM produces a response one token at a time. The state at step is everything seen so far — the prompt plus the tokens already generated, . The action is the next token , drawn from the model’s vocabulary. The policy is the language model itself, with parameters : the conditional distribution over the next token given the context. The transition is trivial and deterministic — the chosen token is appended to the context, — so there is no environment dynamics to learn here, unlike the world models of Chapter 8.
Figure 2:Autoregressive decoding as an MDP. The policy — the language model — samples a token from the current state ; the transition simply appends it, growing the context. The episode ends when the response is complete, at which point a scalar reward scores the whole sequence .
The defining feature of this MDP is its reward structure. There is no per-token reward; quality is a property of the whole response. The episode runs until an end-of-sequence token (or a length limit), and only then does a scalar reward judge the completed answer of length , as shown in Figure 2. This makes the problem a sparse, terminal-reward RL task with very long horizons (hundreds or thousands of tokens) and an enormous action space (the entire vocabulary at every step). It is precisely the regime where the policy-gradient methods of Chapters 4–6 are at home: we can sample complete responses, score them, and push up the probability of the good ones.
Where does the reward come from? That is the central design question of this chapter. We will cover two settings: RLHF learns it from human preference comparisons; reasoning models often compute it from a verifier (does the final answer match the ground truth? does the code pass the tests?). The RL algorithm is largely the same in both cases — what changes is the source of the reward.
RLHF: Aligning Models with Human Preferences¶
The idea of optimising a policy against a learned reward model derived from human preference comparisons predates LLMs — it was demonstrated on simulated robotics and Atari by Christiano et al. Christiano et al., 2017 and adapted to text generation, first for stylistic control Ziegler et al., 2019 and then for summarisation Stiennon et al., 2020. InstructGPT Ouyang et al., 2022 assembled these pieces into the now-standard three-stage recipe that underlies ChatGPT and its successors, shown in Figure 3.
Figure 3:The three stages of RLHF as used in InstructGPT. (1) Supervised fine-tuning produces , which also serves as the frozen reference . (2) A reward model is trained from human rankings of sampled outputs via a Bradley–Terry loss. (3) PPO optimises the policy to maximise minus a KL (Kullback–Leibler) penalty that keeps it close to the reference.
Stage 1 — Supervised fine-tuning¶
A pretrained base model is fine-tuned on a dataset of human-written demonstrations: prompts paired with high-quality responses. This is ordinary supervised learning (next-token prediction on curated data), and it produces a policy that already follows instructions reasonably well. Crucially, this model plays a second role: a frozen copy of it becomes the reference policy used to anchor the later RL stage.
Stage 2 — Reward modelling from human preferences¶
Instead of asking humans to write ideal answers — slow and inconsistent — we ask them the much easier question of comparison: given a prompt and two candidate responses, which is better? For each prompt we sample several outputs from the SFT model and collect human rankings. A reward model , typically the SFT network with its language-modelling head replaced by a scalar output, is then trained to assign higher scores to preferred responses.
The standard choice is the Bradley–Terry model of pairwise preference, under which the probability that the “winning” response is preferred to the “losing” one is . Fitting by maximum likelihood gives the loss
where is the logistic sigmoid and is the dataset of human preference triples . The reward model distils thousands of human comparisons into a differentiable function that can score any response — including responses the policy has never been shown — which is what lets the next stage push beyond the quality ceiling of the demonstration data.
Stage 3 — Policy optimisation with PPO¶
With a reward model in hand, the alignment problem becomes the RL problem of Figure 2: find a policy that produces high-reward responses. InstructGPT Ouyang et al., 2022 solves it with PPO (Chapter 6), treating each prompt as a one-step bandit — a single reward at the end, but many token-level actions along the way — that still unfolds as a long token-level MDP inside the response. The target is a KL-regularised reward objective
where is the language-model policy being updated (parameters , as in Text Generation as a Reinforcement Learning Problem); is the dataset of prompts on which RL is run (distinct from the preference set of Stage 2); is a prompt; is a complete response sampled token by token from ; is the scalar score from the Stage-2 reward model (parameters are frozen during RL); is the frozen reference policy from Stage 1 (a copy of ); is a hyperparameter trading human-preference reward against fidelity to the reference; and measures how far the optimised policy has drifted from the SFT model. For an autoregressive policy the KL decomposes over tokens as
The first term in (2) rewards completions the reward model likes; the second penalises moving off the SFT distribution, where is trustworthy. Without it the policy quickly learns to reward-hack — producing degenerate, repetitive, or off-distribution text that scores highly but humans dislike.
For clarity we present the objective in its core form. InstructGPT’s full recipe adds an optional pretraining-mixing term (the “PPO-ptx” variant) — a language-modelling loss on the pretraining corpus folded into (2) to curb the alignment tax, the regression on standard NLP benchmarks that pure RLHF can cause Ouyang et al., 2022. We omit it here, as it is orthogonal to the RL mechanics.
Per-token rewards. PPO needs a scalar reward at every time step (Chapters 5–6). Because judges only the whole response, InstructGPT assigns it at the final token and folds the KL penalty in at every token Ouyang et al., 2022:
where is the MDP state from Text Generation as a Reinforcement Learning Problem (with the tokens before position ), is the token sampled at step , and is the per-token log-ratio whose sum over is the sequence KL in (3). Intermediate tokens receive only KL shaping; the learned preference reward lands once, when the response is complete.
Advantages and the full PPO loss. With per-token rewards in hand, training follows the actor–critic recipe of Chapters 5–6. A value network — in InstructGPT, initialised from the reward model Ouyang et al., 2022 — estimates how much future (KL-adjusted) return remains from state . Generalized Advantage Estimation (GAE) Schulman et al., 2016 turns the reward stream into advantages and return targets (both defined in Chapter 6); the policy is then updated by maximising the clipped surrogate from Chapter 6, together with value and entropy terms:
The expectation is over prompts and responses sampled from the old policy before each PPO update (the usual importance-sampling setup from Chapter 6). For each token in response : is the importance ratio; is the GAE advantage computed from ; is PPO’s clip width (typically 0.2); is the critic with parameters ; is the bootstrapped return target; is the policy entropy at ; and weight the auxiliary terms (often , for LLMs). As in vanilla PPO, the rollout buffer is frozen and reused for epochs of mini-batch SGD before and fresh responses are sampled.
GRPO: Reinforcement Learning Without a Critic¶
PPO is an actor–critic method (Chapter 5): alongside the policy it trains a value network to estimate the baseline used to form the advantage. For LLMs this critic is costly — it is typically another network the size of the policy, doubling memory and adding a separately-tuned learning problem — and it is awkward, because a reliable per-token value estimate is hard to learn when the reward only arrives at the very end of a long sequence.
Group Relative Policy Optimization (GRPO) Shao et al., 2024, introduced for the DeepSeekMath models, removes the critic entirely. The insight is that the only thing the baseline does is tell us whether an action was better than average; for a given prompt we can estimate “average” empirically by sampling a whole group of responses and comparing them to each other. For a prompt , GRPO samples responses from the current policy, scores each with the reward , and uses the group-normalised reward as the advantage shared by every token of response :[1]
Responses better than the group mean get a positive advantage and are reinforced; worse-than-average ones are suppressed. This is the classic REINFORCE baseline (Chapter 4) estimated by Monte Carlo within the group, rather than by a learned value function. The policy is then updated with the familiar PPO-style clipped objective, plus a KL penalty to a reference model:
where is the importance ratio. We show (7) at the sequence level for clarity; in the DeepSeekMath implementation the clip and the KL penalty are applied per token, averaging over the tokens within each response. Compared with the full PPO loss (5), GRPO keeps the clip and the KL but drops the value and entropy terms and replaces the GAE advantage with the group-relative one of (6).
This design is an especially good fit for tasks with verifiable rewards — mathematics, coding, logic — where a response can be checked automatically (right answer? tests pass?) and no learned reward model is needed at all. The reward is then a cheap, exact, hack-resistant signal, and group sampling provides a low-variance baseline for free. That combination is what makes GRPO the workhorse of modern reasoning models.
RL for Reasoning: DeepSeek-R1¶
The most striking recent demonstration of RL in LLMs is DeepSeek-R1 Guo & others, 2025, which showed that reasoning ability can be incentivised by reinforcement learning rather than imitated from human-written chains of thought. Figure 4 summarises the two models the paper introduces.
Figure 4:Top: DeepSeek-R1-Zero is trained with GRPO directly on the base model using only rule-based, verifiable rewards — and develops long chains of thought on its own. Bottom: DeepSeek-R1 wraps that reasoning RL in a four-stage pipeline (cold-start SFT, reasoning RL, rejection-sampling SFT, and a final broad RL stage) to fix readability and broaden the model’s behaviour.
R1-Zero: reasoning from pure RL¶
DeepSeek-R1-Zero starts from the DeepSeek-V3 base model and applies GRPO with no supervised fine-tuning at all — skipping Stage 1 of RLHF entirely. The reward is purely rule-based: an accuracy reward that checks whether the final answer is correct (for maths problems with known solutions, or code that passes tests) and a format reward that requires the model to place its reasoning inside designated <think> tags. There is no learned reward model and therefore little room for reward hacking.
Trained against this signal alone, the model spontaneously learns to generate longer and longer chains of thought, allocating more inference-time computation to harder problems, and exhibits emergent behaviours such as re-checking its own work and exploring alternative approaches — what the authors memorably call an “aha moment.” This is a notable result: complex reasoning strategies emerged from a scalar correctness reward and group-relative policy gradients, without ever being shown a human example of how to reason. The cost is that R1-Zero’s output is often hard to read — it mixes languages and produces unpolished, sometimes chaotic text — because the reward cares only about the final answer, not the legibility of the path to it.
R1: a multi-stage pipeline¶
To keep the reasoning gains while producing a usable assistant, DeepSeek-R1 surrounds the RL with additional stages. A small cold-start SFT on a few thousand high-quality long-chain-of-thought examples gives the model a readable starting style. Reasoning-oriented RL (GRPO again) then sharpens its problem-solving, now with an added language-consistency reward to curb the mixing seen in R1-Zero. Rejection sampling harvests the best responses from that RL checkpoint and combines them with general-purpose data for another round of SFT. A final RL stage across diverse prompts tunes for helpfulness and harmlessness. The result matches frontier proprietary reasoning models on maths and coding benchmarks — on the competition-maths benchmarks AIME 2024 and MATH-500, DeepSeek-R1 reaches 79.8% and 97.3% pass@1 (the fraction solved correctly on the first attempt), on par with OpenAI’s o1 — while remaining a coherent assistant, and, importantly, the recipe and weights were released openly.
Direct Preference Optimization (DPO)¶
RLHF is powerful but operationally heavy: it requires training a separate reward model and then running an on-policy RL loop that repeatedly samples from the model, scores those samples, and updates — sensitive to many hyperparameters. Direct Preference Optimization (DPO) Rafailov et al., 2023 asks whether the RL stage can be skipped while optimising the same objective. Remarkably, it can.
The starting point is a known fact about the KL-regularised objective (2): for a fixed reward , it has a closed-form optimal policy,
where is a normalising constant. DPO’s key move is to invert this relationship: solving (8) for the reward expresses it in terms of the optimal policy and the reference,
Substituting this implicit reward into the Bradley–Terry preference likelihood (1) makes the intractable cancel (it is the same for and ), leaving a loss that depends only on the policy being trained and the frozen reference:
This is an ordinary supervised classification loss on preference pairs: it simply increases the policy’s relative log-probability of preferred responses over dispreferred ones, scaled by and anchored to the reference. No reward model is trained, and no responses are sampled during training. Figure 5 contrasts the two routes.
Figure 5:RLHF (top) trains an explicit reward model and then optimises the policy with an online RL loop. DPO (bottom) collapses both stages into a single offline loss on preference pairs, with the reward implicit in the policy–reference log-ratio. Both target the same KL-regularised objective.
So is DPO reinforcement learning? It optimises exactly the RLHF objective of (2), and the policy–reference log-ratio in (10) is the implicit reward — so in that sense it inherits the RL formulation. But mechanically it is supervised learning: a closed-form, offline loss over a fixed preference dataset, with no sampling, no reward model, and no exploration. That trade-off has consequences. DPO is far simpler and more stable to train, which has made it the default for preference tuning at moderate scale. But because it never samples from the current policy, it cannot discover and reinforce new high-reward behaviours the way an online RL loop can — it can only re-weight responses present in the preference data. This is precisely why reasoning models, which need to explore long solution paths and be rewarded by a verifier, rely on online methods like GRPO rather than DPO.
Table 1 consolidates the three approaches along the axes that distinguish them.
Table 1:The three preference-optimisation routes of this chapter at a glance. All target the same KL-regularised objective (2); they differ in machinery and in what they can learn.
RLHF (PPO) | GRPO | DPO | |
|---|---|---|---|
Critic / value network | Yes | No — group baseline | No |
Reward source | Learned reward model | Verifier or reward model | Implicit (policy–reference log-ratio) |
Online / offline | Online (samples each update) | Online (samples a group) | Offline (fixed preference data) |
Explores new behaviours? | Yes | Yes | No — re-weights existing data |
KL anchor to reference | Yes (explicit penalty) | Yes (explicit penalty) | Yes (implicit in the loss) |
Summary¶
This final chapter showed that text generation is a reinforcement learning problem in disguise. Autoregressive decoding is a Markov decision process — the state is the prompt plus tokens generated so far, each next token is an action, and quality is judged by a sparse terminal reward — so the policy-gradient methods of Chapters 4–6 apply directly, albeit with very long horizons and an enormous action space at every step.
The dominant alignment recipe is RLHF Ouyang et al., 2022: supervised fine-tuning on demonstrations, a Bradley–Terry reward model trained from human preference comparisons, and PPO against a KL-regularised objective that rewards high-scoring completions while penalising drift from a frozen reference policy. The KL anchor is essential; without it the policy exploits imperfections in the learned reward model rather than genuinely improving.
GRPO Shao et al., 2024 offers a critic-free alternative: sample a group of responses per prompt, normalise their rewards within the group to form advantages, and update with the same PPO clip and KL penalty. This is especially effective when rewards are verifiable — as in DeepSeek-R1 Guo & others, 2025, where rule-based correctness and format rewards alone were enough to incentivise long chains of thought, including in R1-Zero with no supervised fine-tuning at all. Finally, DPO Rafailov et al., 2023 reaches the same KL-regularised objective offline, as a supervised loss on preference pairs, but sacrifices the online exploration that reasoning RL depends on.
Further Reading¶
Nathan Lambert, Reinforcement Learning from Human Feedback Lambert, 2026 — a comprehensive, freely available book on RLHF and LLM post-training: instruction tuning, reward modelling, PPO, rejection sampling, DPO, GRPO, and emerging topics such as evaluation and tool use, with companion code and lecture videos.
Hugging Face, Open R1 for Students (Chapter 12) Hugging Face, 2025 — a hands-on course module on RL for language models, covering RLHF basics, the DeepSeek-R1 paper, GRPO training with the TRL library, and a practical alignment exercise tied to the Open R1 community project.
Unsloth, Reinforcement Learning (RL) Guide Unsloth AI, 2025 — a practitioner-oriented walkthrough from RLHF and PPO through GRPO and verifiable rewards, with memory-efficient training recipes, reward-function design tips, Colab notebooks, and a step-by-step tutorial for training a reasoning model with GRPO on limited hardware.
- Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P. F., Leike, J., & Lowe, R. (2022). Training Language Models to Follow Instructions with Human Feedback. Advances in Neural Information Processing Systems (NeurIPS), 35, 27730–27744.
- Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y. K., Wu, Y., & Guo, D. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv Preprint arXiv:2402.03300.
- Guo, D., & others. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Nature, 645(8081), 633–638. 10.1038/s41586-025-09422-z
- Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Advances in Neural Information Processing Systems (NeurIPS), 36.
- Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep Reinforcement Learning from Human Preferences. Advances in Neural Information Processing Systems (NeurIPS), 30.
- Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., & Irving, G. (2019). Fine-Tuning Language Models from Human Preferences. arXiv Preprint arXiv:1909.08593.
- Stiennon, N., Ouyang, L., Wu, J., Ziegler, D. M., Lowe, R., Voss, C., Radford, A., Amodei, D., & Christiano, P. F. (2020). Learning to Summarize from Human Feedback. Advances in Neural Information Processing Systems (NeurIPS), 33, 3008–3021.
- Schulman, J., Moritz, P., Levine, S., Jordan, M. I., & Abbeel, P. (2016). High-Dimensional Continuous Control Using Generalized Advantage Estimation. Proceedings of the 4th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1506.02438
- Lambert, N. (2026). Reinforcement Learning from Human Feedback. Online. https://rlhfbook.com
- Hugging Face. (2025). Open R1 for Students: Reinforcement Learning and LLMs. Online course, Chapter 12. https://huggingface.co/docs/course/main/en/chapter12/1
- Unsloth AI. (2025). Reinforcement Learning (RL) Guide. Online documentation. https://unsloth.ai/docs/get-started/reinforcement-learning-rl-guide
- Liu, Z., Chen, C., Li, W., Qi, P., Pang, T., Du, C., Lee, W. S., & Lin, M. (2025). Understanding R1-Zero-Like Training: A Critical Perspective. arXiv Preprint arXiv:2503.20783.