Q-Learning

In the first chapter we formalized the reinforcement learning problem as a Markov Decision Process and introduced the notions of policy, value function, and Bellman equation. The central question left open was: how does an agent actually learn a good policy from experience? This chapter answers that question for the simple tabular setting. We develop value-based methods — algorithms that learn a value function and derive a policy from it — culminating in the Q-Learning algorithm Watkins & Dayan, 1992, the most important stepping stone toward the deep RL methods we will study in subsequent chapters.

We begin with the two types of value functions (state-value and action-value), revisit the Bellman equation in more depth, and compare two model-free strategies for estimating values: Monte Carlo and Temporal Difference learning. We then present Q-Learning itself — an off-policy, TD-based algorithm that converges to the optimal action-value function — and walk through a worked example.

Slides for this chapter (open full screen).

Value-Based Methods¶

Recall from the previous chapter that there are two broad families of approaches to finding a good policy:

Policy-based methods learn the policy $\pi(a \mid s)$ directly, by parameterizing it and optimizing with gradient ascent.
Value-based methods learn a value function that estimates how good each state or state-action pair is, and then derive a policy from it (e.g., by acting greedily).

In this chapter we focus on value-based methods. The core idea is simple: if we can accurately estimate the value of every action in every state, the optimal policy is just the one that always picks the best action. The challenge, of course, is learning those values from experience.

The State-Value Function¶

The state-value function $V^\pi(s)$ measures the expected cumulative discounted reward an agent receives when starting from state $s$ and following policy $\pi$ thereafter Sutton & Barto, 2018:

V^\pi(s) = \mathbb{E}_\pi \left[ G_t \mid S_t = s \right] = \mathbb{E}_\pi \left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \mid S_t = s \right]

(1)

Here $G_t$ denotes the return — the total discounted reward collected from time step $t$ onward, $G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots$ , where $R_{t+k+1}$ is the reward received at step $t+k+1$ and $\gamma \in [0, 1]$ is the discount factor that down-weights future rewards.

Intuitively, $V^\pi(s)$ answers the question: “How good is it to be in state $s$ if I follow policy $\pi$ from here on?”

The Action-Value Function¶

The action-value function $Q^\pi(s, a)$ is closely related but conditions on both the current state and the action taken Sutton & Barto, 2018:

Q^\pi(s, a) = \mathbb{E}_\pi \left[ G_t \mid S_t = s, A_t = a \right]

(2)

This answers: “How good is it to take action $a$ in state $s$ , and then follow policy $\pi$ ?”

State-value V(s) assigns a single number to each state, while action-value Q(s, a) assigns a number to each state-action pair. — Figure 2:State-value $V(s)$ assigns a single number to each state, while action-value $Q(s, a)$ assigns a number to each state-action pair.

Why Q Is More Useful for Control¶

With $V(s)$ alone, the agent cannot decide which action to take without a model of the environment — it would need to evaluate $V(s')$ for every possible successor state $s'$ reached by each action, which requires knowledge of the transition function $p(s' \mid s, a)$ .

With $Q(s, a)$ , the agent simply selects:

\pi(s) = \arg\max_a Q(s, a)

(3)

No model needed. This makes the action-value function the natural foundation for model-free control, which is the dominant paradigm in deep RL. If we can learn the optimal action-value function $Q^*$ , we immediately obtain the optimal policy.

The Bellman Equation¶

The Bellman equation is the mathematical backbone of nearly every RL algorithm. It expresses a recursive relationship: the value of a state equals the immediate reward plus the discounted value of the successor state Sutton & Barto, 2018.

Bellman Equation for $V^\pi$ ¶

V^\pi(s) = \mathbb{E}_\pi \left[ R_{t+1} + \gamma \, V^\pi(S_{t+1}) \mid S_t = s \right]

(4)

Expanding the expectation:

V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} p(s' \mid s, a) \left[ r(s, a) + \gamma \, V^\pi(s') \right]

(5)

Bellman backup diagram. The value of state s depends on the immediate rewards r and the values of successor states s', weighted by the policy \pi and transition probabilities p. — Figure 3:Bellman backup diagram. The value of state $s$ depends on the immediate rewards $r$ and the values of successor states $s'$ , weighted by the policy $\pi$ and transition probabilities $p$ .

Bellman Equation for $Q^\pi$ ¶

Similarly, the action-value function satisfies:

Q^\pi(s, a) = \sum_{s'} p(s' \mid s, a) \left[ r(s, a) + \gamma \sum_{a'} \pi(a' \mid s') \, Q^\pi(s', a') \right]

(6)

Bellman Optimality Equations¶

The optimal value functions, denoted $V^*$ and $Q^*$ , satisfy:

V^*(s) = \max_a \sum_{s'} p(s' \mid s, a) \left[ r(s, a) + \gamma \, V^*(s') \right]

(7)

Q^*(s, a) = \sum_{s'} p(s' \mid s, a) \left[ r(s, a) + \gamma \max_{a'} Q^*(s', a') \right]

(8)

The key difference from the policy-dependent versions is that the $\max$ operator replaces the policy-weighted average. If we could solve these equations directly, we would obtain the optimal policy $\pi^*(s) = \arg\max_a Q^*(s, a)$ . In practice, we cannot solve them analytically for large problems — but we can estimate $Q^*$ from experience using algorithms like Q-Learning.

Monte Carlo vs Temporal Difference Learning¶

Both Monte Carlo (MC) and Temporal Difference (TD) methods estimate value functions from experience, without requiring a model. They differ fundamentally in when and how they update their estimates.

Monte Carlo Methods¶

Monte Carlo methods wait until the end of an episode to compute the actual return $G_t$ — the sum of discounted rewards from time step $t$ to the terminal state Sutton & Barto, 2018, Ch. 5:

G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots + \gamma^{T-t-1} R_T

(9)

The value estimate is then updated toward this observed return:

V(S_t) \leftarrow V(S_t) + \alpha \left[ G_t - V(S_t) \right]

(10)

where $\alpha$ is the learning rate. Since $G_t$ is a sample of the true expected return, MC estimates are unbiased. However, because each trajectory can vary widely, the estimates have high variance. MC also requires complete episodes, limiting it to episodic tasks.

Temporal Difference Learning¶

Temporal Difference learning, introduced by Sutton Sutton, 1988, updates the value estimate at every time step using a bootstrapped target:

V(S_t) \leftarrow V(S_t) + \alpha \left[ \underbrace{R_{t+1} + \gamma \, V(S_{t+1})}_{\text{TD target}} - V(S_t) \right]

(11)

The quantity $\delta_t = R_{t+1} + \gamma \, V(S_{t+1}) - V(S_t)$ is called the TD error. Instead of waiting for the full return, TD uses the immediate reward plus the current estimate of the next state’s value. This bootstrapping introduces bias (because $V(S_{t+1})$ is itself an estimate), but it reduces variance and allows learning to happen online, step by step, even in continuing (non-episodic) tasks.

Figure 4:Monte Carlo computes the actual return over the full trajectory. Temporal Difference bootstraps from a single transition.

Comparison¶

	Monte Carlo	Temporal Difference
Updates	After full episode	After every step
Target	Actual return $G_t$	Estimated $R_{t+1} + \gamma V(S_{t+1})$
Bias	Unbiased	Biased (bootstrap)
Variance	High	Lower
Data efficiency	Lower	Higher
Episode required	Yes	No
Bootstrapping	No	Yes

As Sutton and Barto put it, TD learning “combines the sampling of Monte Carlo with the bootstrapping of dynamic programming” Sutton & Barto, 2018. Q-Learning, which we turn to next, is a TD method.

Q-Learning¶

Q-Learning is an off-policy, value-based algorithm that uses temporal difference updates to learn the optimal action-value function $Q^*$ directly Watkins & Dayan, 1992. It was introduced by Watkins in his 1989 PhD thesis and is one of the most fundamental algorithms in reinforcement learning.

The Q-Table¶

In the tabular setting, Q-Learning maintains a table $Q(s, a)$ with one entry for every state-action pair. Each entry stores the agent’s current estimate of how good it is to take action $a$ in state $s$ . The table is initialized to zeros (or small random values) and updated through interaction with the environment.

Figure 5:A Q-table with states as rows and actions as columns. The highlighted cells indicate the greedy action (highest Q-value) for each state. The policy simply picks the action with the maximum Q-value.

Epsilon-Greedy Exploration¶

To balance exploration and exploitation, Q-Learning uses an $\epsilon$ -greedy strategy for action selection:

With probability $1 - \epsilon$ : exploit — select $\arg\max_a Q(s, a)$
With probability $\epsilon$ : explore — select a random action

The exploration rate $\epsilon$ typically starts at 1.0 (fully random) and is gradually decayed over training, so that the agent explores broadly at first and increasingly exploits its learned Q-values as they become more accurate. Common decay strategies include linear decay and exponential decay.

The Q-Learning Update Rule¶

After taking action $A_t$ in state $S_t$ , observing reward $R_{t+1}$ and next state $S_{t+1}$ , the Q-value is updated:

Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ \underbrace{R_{t+1} + \gamma \max_a Q(S_{t+1}, a)}_{\text{TD target}} \;-\; \underbrace{Q(S_t, A_t)}_{\text{current estimate}} \right]

(12)

Breaking this down:

$\alpha$ is the learning rate, controlling how much the estimate shifts toward the new information.
$R_{t+1} + \gamma \max_a Q(S_{t+1}, a)$ is the TD target — the immediate reward plus the discounted value of the best action in the next state.
$R_{t+1} + \gamma \max_a Q(S_{t+1}, a) - Q(S_t, A_t)$ is the TD error — the difference between the target and the current estimate.

The critical feature is the $\max$ operator: the update always uses the value of the best action at the next state, regardless of which action the $\epsilon$ -greedy policy actually selected. This is what makes Q-Learning off-policy.

The Q-Learning algorithm loop. The agent observes a state, selects an action via \epsilon-greedy, executes it, observes the reward and next state, and updates its Q-table. — Figure 6:The Q-Learning algorithm loop. The agent observes a state, selects an action via $\epsilon$ -greedy, executes it, observes the reward and next state, and updates its Q-table.

The Complete Algorithm¶

The full Q-Learning algorithm is:

Initialize $Q(s, a) = 0$ for all $s \in \mathcal{S}$ , $a \in \mathcal{A}$
For each episode:
1. Initialize state $S$
2. For each step of the episode:
  1. Choose action $A$ from $S$ using $\epsilon$ -greedy on $Q$
  2. Take action $A$ , observe reward $R$ and next state $S'$
  3. Update: $Q(S, A) \leftarrow Q(S, A) + \alpha \left[ R + \gamma \max_a Q(S', a) - Q(S, A) \right]$
  4. $S \leftarrow S'$

Under standard conditions — all state-action pairs are visited infinitely often and the learning rate satisfies the Robbins-Monro conditions ( $\sum_t \alpha_t = \infty$ , $\sum_t \alpha_t^2 < \infty$ ) — Q-Learning converges to $Q^*$ Watkins & Dayan, 1992.

Off-Policy vs On-Policy: Q-Learning vs SARSA¶

Q-Learning is off-policy: the policy used for selecting actions during training (the behavior policy, $\epsilon$ -greedy) differs from the policy being optimized (the target policy, greedy). The update uses $\max_a Q(S', a)$ — the value of the best possible action, not the action actually taken.

SARSA (State-Action-Reward-State-Action) is the on-policy counterpart Rummery & Niranjan, 1994. Its update rule is:

Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \, Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right]

(13)

The difference is subtle but significant: SARSA uses $Q(S_{t+1}, A_{t+1})$ — the Q-value of the action that was actually taken — rather than the maximum. This means SARSA’s updates reflect the exploration noise from the $\epsilon$ -greedy policy. In practice, SARSA tends to learn a more conservative policy that accounts for the agent’s own exploratory behavior, while Q-Learning learns the optimal policy assuming perfect exploitation.

A classic illustration is the cliff-walking environment from Sutton and Barto Sutton & Barto, 2018, Example 6.6: SARSA learns a safe path far from the cliff (because it accounts for occasional random steps toward the edge), while Q-Learning learns the optimal shortest path right along the cliff edge.

The off-policy nature of Q-Learning is also what makes it a natural foundation for experience replay and Deep Q-Networks (DQN) Mnih et al., 2015, which we will cover in the next chapter.

Summary¶

This chapter introduced the foundations of value-based reinforcement learning:

Value functions ( $V$ and $Q$ ) measure expected cumulative reward. The action-value function $Q$ is particularly useful because it enables model-free control.
The Bellman equation provides a recursive decomposition of values that underlies nearly all RL algorithms.
Monte Carlo methods estimate values from complete episodes (unbiased, high variance), while Temporal Difference methods bootstrap from single steps (biased, lower variance, more data-efficient).
Q-Learning combines off-policy control with TD updates to converge to the optimal $Q^*$ . Its on-policy counterpart is SARSA.

In the next chapter, we move from tabular Q-Learning to Deep Q-Networks (DQN) — using neural networks to approximate $Q$ in high-dimensional state spaces, which opened the door to learning directly from raw pixels in Atari games Mnih et al., 2015.

References¶

Watkins, C. J. C. H., & Dayan, P. (1992). Q-Learning. Machine Learning, 8(3–4), 279–292. 10.1007/BF00992698
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
Sutton, R. S. (1988). Learning to Predict by the Methods of Temporal Differences. Machine Learning, 3(1), 9–44. 10.1007/BF00115009
Rummery, G. A., & Niranjan, M. (1994). On-Line Q-Learning Using Connectionist Systems (Techreport CUED/F-INFENG/TR 166). Cambridge University Engineering Department.
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-Level Control through Deep Reinforcement Learning. Nature, 518(7540), 529–533. 10.1038/nature14236

Value-Based Methods¶

The State-Value Function¶

The Action-Value Function¶

Why Q Is More Useful for Control¶

The Bellman Equation¶

Bellman Equation for VπV^\piVπ¶

Bellman Equation for QπQ^\piQπ¶

Bellman Optimality Equations¶

Monte Carlo vs Temporal Difference Learning¶

Monte Carlo Methods¶

Temporal Difference Learning¶

Comparison¶

Q-Learning¶

The Q-Table¶

Epsilon-Greedy Exploration¶

The Q-Learning Update Rule¶

The Complete Algorithm¶

Off-Policy vs On-Policy: Q-Learning vs SARSA¶

Summary¶

Bellman Equation for $V^\pi$ ¶

Bellman Equation for $Q^\pi$ ¶