Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Q-Learning

Value-Based Methods, Bellman Equation, and Temporal-Difference Learning

In the first chapter we formalized the reinforcement learning problem as a Markov Decision Process and introduced the notions of policy, value function, and Bellman equation. The central question left open was: how does an agent actually learn a good policy from experience? This chapter answers that question for the simple tabular setting. We develop value-based methods — algorithms that learn a value function and derive a policy from it — culminating in the Q-Learning algorithm Watkins & Dayan, 1992, the most important stepping stone toward the deep RL methods we will study in subsequent chapters.

We begin with the two types of value functions (state-value and action-value), revisit the Bellman equation in more depth, and compare two model-free strategies for estimating values: Monte Carlo and Temporal Difference learning. We then present Q-Learning itself — an off-policy, TD-based algorithm that converges to the optimal action-value function — and walk through a worked example.

Slides for this chapter (open full screen).

Value-Based Methods

Recall from the previous chapter that there are two broad families of approaches to finding a good policy:

In this chapter we focus on value-based methods. The core idea is simple: if we can accurately estimate the value of every action in every state, the optimal policy is just the one that always picks the best action. The challenge, of course, is learning those values from experience.

The State-Value Function

The state-value function Vπ(s)V^\pi(s) measures the expected cumulative discounted reward an agent receives when starting from state ss and following policy π\pi thereafter Sutton & Barto, 2018:

Vπ(s)=Eπ[GtSt=s]=Eπ[k=0γkRt+k+1St=s]V^\pi(s) = \mathbb{E}_\pi \left[ G_t \mid S_t = s \right] = \mathbb{E}_\pi \left[ \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} \mid S_t = s \right]

Here GtG_t denotes the return — the total discounted reward collected from time step tt onward, Gt=Rt+1+γRt+2+γ2Rt+3+G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots, where Rt+k+1R_{t+k+1} is the reward received at step t+k+1t+k+1 and γ[0,1]\gamma \in [0, 1] is the discount factor that down-weights future rewards.

Intuitively, Vπ(s)V^\pi(s) answers the question: “How good is it to be in state ss if I follow policy π\pi from here on?”

The Action-Value Function

The action-value function Qπ(s,a)Q^\pi(s, a) is closely related but conditions on both the current state and the action taken Sutton & Barto, 2018:

Qπ(s,a)=Eπ[GtSt=s,At=a]Q^\pi(s, a) = \mathbb{E}_\pi \left[ G_t \mid S_t = s, A_t = a \right]

This answers: “How good is it to take action aa in state ss, and then follow policy π\pi?”

State-value V(s) assigns a single number to each state, while action-value Q(s, a) assigns a number to each state-action pair.

Figure 2:State-value V(s)V(s) assigns a single number to each state, while action-value Q(s,a)Q(s, a) assigns a number to each state-action pair.

Why Q Is More Useful for Control

With V(s)V(s) alone, the agent cannot decide which action to take without a model of the environment — it would need to evaluate V(s)V(s') for every possible successor state ss' reached by each action, which requires knowledge of the transition function p(ss,a)p(s' \mid s, a).

With Q(s,a)Q(s, a), the agent simply selects:

π(s)=argmaxaQ(s,a)\pi(s) = \arg\max_a Q(s, a)

No model needed. This makes the action-value function the natural foundation for model-free control, which is the dominant paradigm in deep RL. If we can learn the optimal action-value function QQ^*, we immediately obtain the optimal policy.

The Bellman Equation

The Bellman equation is the mathematical backbone of nearly every RL algorithm. It expresses a recursive relationship: the value of a state equals the immediate reward plus the discounted value of the successor state Sutton & Barto, 2018.

Bellman Equation for VπV^\pi

Vπ(s)=Eπ[Rt+1+γVπ(St+1)St=s]V^\pi(s) = \mathbb{E}_\pi \left[ R_{t+1} + \gamma \, V^\pi(S_{t+1}) \mid S_t = s \right]

Expanding the expectation:

Vπ(s)=aπ(as)sp(ss,a)[r(s,a)+γVπ(s)]V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} p(s' \mid s, a) \left[ r(s, a) + \gamma \, V^\pi(s') \right]
Bellman backup diagram. The value of state s depends on the immediate rewards r and the values of successor states s', weighted by the policy \pi and transition probabilities p.

Figure 3:Bellman backup diagram. The value of state ss depends on the immediate rewards rr and the values of successor states ss', weighted by the policy π\pi and transition probabilities pp.

Bellman Equation for QπQ^\pi

Similarly, the action-value function satisfies:

Qπ(s,a)=sp(ss,a)[r(s,a)+γaπ(as)Qπ(s,a)]Q^\pi(s, a) = \sum_{s'} p(s' \mid s, a) \left[ r(s, a) + \gamma \sum_{a'} \pi(a' \mid s') \, Q^\pi(s', a') \right]

Bellman Optimality Equations

The optimal value functions, denoted VV^* and QQ^*, satisfy:

V(s)=maxasp(ss,a)[r(s,a)+γV(s)]V^*(s) = \max_a \sum_{s'} p(s' \mid s, a) \left[ r(s, a) + \gamma \, V^*(s') \right]
Q(s,a)=sp(ss,a)[r(s,a)+γmaxaQ(s,a)]Q^*(s, a) = \sum_{s'} p(s' \mid s, a) \left[ r(s, a) + \gamma \max_{a'} Q^*(s', a') \right]

The key difference from the policy-dependent versions is that the max\max operator replaces the policy-weighted average. If we could solve these equations directly, we would obtain the optimal policy π(s)=argmaxaQ(s,a)\pi^*(s) = \arg\max_a Q^*(s, a). In practice, we cannot solve them analytically for large problems — but we can estimate QQ^* from experience using algorithms like Q-Learning.

Monte Carlo vs Temporal Difference Learning

Both Monte Carlo (MC) and Temporal Difference (TD) methods estimate value functions from experience, without requiring a model. They differ fundamentally in when and how they update their estimates.

Monte Carlo Methods

Monte Carlo methods wait until the end of an episode to compute the actual return GtG_t — the sum of discounted rewards from time step tt to the terminal state Sutton & Barto, 2018, Ch. 5:

Gt=Rt+1+γRt+2+γ2Rt+3++γTt1RTG_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots + \gamma^{T-t-1} R_T

The value estimate is then updated toward this observed return:

V(St)V(St)+α[GtV(St)]V(S_t) \leftarrow V(S_t) + \alpha \left[ G_t - V(S_t) \right]

where α\alpha is the learning rate. Since GtG_t is a sample of the true expected return, MC estimates are unbiased. However, because each trajectory can vary widely, the estimates have high variance. MC also requires complete episodes, limiting it to episodic tasks.

Temporal Difference Learning

Temporal Difference learning, introduced by Sutton Sutton, 1988, updates the value estimate at every time step using a bootstrapped target:

V(St)V(St)+α[Rt+1+γV(St+1)TD targetV(St)]V(S_t) \leftarrow V(S_t) + \alpha \left[ \underbrace{R_{t+1} + \gamma \, V(S_{t+1})}_{\text{TD target}} - V(S_t) \right]

The quantity δt=Rt+1+γV(St+1)V(St)\delta_t = R_{t+1} + \gamma \, V(S_{t+1}) - V(S_t) is called the TD error. Instead of waiting for the full return, TD uses the immediate reward plus the current estimate of the next state’s value. This bootstrapping introduces bias (because V(St+1)V(S_{t+1}) is itself an estimate), but it reduces variance and allows learning to happen online, step by step, even in continuing (non-episodic) tasks.

Monte Carlo computes the actual return over the full trajectory. Temporal Difference bootstraps from a single transition.

Figure 4:Monte Carlo computes the actual return over the full trajectory. Temporal Difference bootstraps from a single transition.

Comparison

Monte CarloTemporal Difference
UpdatesAfter full episodeAfter every step
TargetActual return GtG_tEstimated Rt+1+γV(St+1)R_{t+1} + \gamma V(S_{t+1})
BiasUnbiasedBiased (bootstrap)
VarianceHighLower
Data efficiencyLowerHigher
Episode requiredYesNo
BootstrappingNoYes

As Sutton and Barto put it, TD learning “combines the sampling of Monte Carlo with the bootstrapping of dynamic programming” Sutton & Barto, 2018. Q-Learning, which we turn to next, is a TD method.

Q-Learning

Q-Learning is an off-policy, value-based algorithm that uses temporal difference updates to learn the optimal action-value function QQ^* directly Watkins & Dayan, 1992. It was introduced by Watkins in his 1989 PhD thesis and is one of the most fundamental algorithms in reinforcement learning.

The Q-Table

In the tabular setting, Q-Learning maintains a table Q(s,a)Q(s, a) with one entry for every state-action pair. Each entry stores the agent’s current estimate of how good it is to take action aa in state ss. The table is initialized to zeros (or small random values) and updated through interaction with the environment.

A Q-table with states as rows and actions as columns. The highlighted cells indicate the greedy action (highest Q-value) for each state. The policy simply picks the action with the maximum Q-value.

Figure 5:A Q-table with states as rows and actions as columns. The highlighted cells indicate the greedy action (highest Q-value) for each state. The policy simply picks the action with the maximum Q-value.

Epsilon-Greedy Exploration

To balance exploration and exploitation, Q-Learning uses an ϵ\epsilon-greedy strategy for action selection:

The exploration rate ϵ\epsilon typically starts at 1.0 (fully random) and is gradually decayed over training, so that the agent explores broadly at first and increasingly exploits its learned Q-values as they become more accurate. Common decay strategies include linear decay and exponential decay.

The Q-Learning Update Rule

After taking action AtA_t in state StS_t, observing reward Rt+1R_{t+1} and next state St+1S_{t+1}, the Q-value is updated:

Q(St,At)Q(St,At)+α[Rt+1+γmaxaQ(St+1,a)TD target    Q(St,At)current estimate]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ \underbrace{R_{t+1} + \gamma \max_a Q(S_{t+1}, a)}_{\text{TD target}} \;-\; \underbrace{Q(S_t, A_t)}_{\text{current estimate}} \right]

Breaking this down:

The critical feature is the max\max operator: the update always uses the value of the best action at the next state, regardless of which action the ϵ\epsilon-greedy policy actually selected. This is what makes Q-Learning off-policy.

The Q-Learning algorithm loop. The agent observes a state, selects an action via \epsilon-greedy, executes it, observes the reward and next state, and updates its Q-table.

Figure 6:The Q-Learning algorithm loop. The agent observes a state, selects an action via ϵ\epsilon-greedy, executes it, observes the reward and next state, and updates its Q-table.

The Complete Algorithm

The full Q-Learning algorithm is:

  1. Initialize Q(s,a)=0Q(s, a) = 0 for all sSs \in \mathcal{S}, aAa \in \mathcal{A}

  2. For each episode:

    1. Initialize state SS

    2. For each step of the episode:

      1. Choose action AA from SS using ϵ\epsilon-greedy on QQ

      2. Take action AA, observe reward RR and next state SS'

      3. Update: Q(S,A)Q(S,A)+α[R+γmaxaQ(S,a)Q(S,A)]Q(S, A) \leftarrow Q(S, A) + \alpha \left[ R + \gamma \max_a Q(S', a) - Q(S, A) \right]

      4. SSS \leftarrow S'

Under standard conditions — all state-action pairs are visited infinitely often and the learning rate satisfies the Robbins-Monro conditions (tαt=\sum_t \alpha_t = \infty, tαt2<\sum_t \alpha_t^2 < \infty) — Q-Learning converges to QQ^* Watkins & Dayan, 1992.

Off-Policy vs On-Policy: Q-Learning vs SARSA

Q-Learning is off-policy: the policy used for selecting actions during training (the behavior policy, ϵ\epsilon-greedy) differs from the policy being optimized (the target policy, greedy). The update uses maxaQ(S,a)\max_a Q(S', a) — the value of the best possible action, not the action actually taken.

SARSA (State-Action-Reward-State-Action) is the on-policy counterpart Rummery & Niranjan, 1994. Its update rule is:

Q(St,At)Q(St,At)+α[Rt+1+γQ(St+1,At+1)Q(St,At)]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma \, Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right]

The difference is subtle but significant: SARSA uses Q(St+1,At+1)Q(S_{t+1}, A_{t+1}) — the Q-value of the action that was actually taken — rather than the maximum. This means SARSA’s updates reflect the exploration noise from the ϵ\epsilon-greedy policy. In practice, SARSA tends to learn a more conservative policy that accounts for the agent’s own exploratory behavior, while Q-Learning learns the optimal policy assuming perfect exploitation.

A classic illustration is the cliff-walking environment from Sutton and Barto Sutton & Barto, 2018, Example 6.6: SARSA learns a safe path far from the cliff (because it accounts for occasional random steps toward the edge), while Q-Learning learns the optimal shortest path right along the cliff edge.

The off-policy nature of Q-Learning is also what makes it a natural foundation for experience replay and Deep Q-Networks (DQN) Mnih et al., 2015, which we will cover in the next chapter.

Summary

This chapter introduced the foundations of value-based reinforcement learning:

In the next chapter, we move from tabular Q-Learning to Deep Q-Networks (DQN) — using neural networks to approximate QQ in high-dimensional state spaces, which opened the door to learning directly from raw pixels in Atari games Mnih et al., 2015.

References
  1. Watkins, C. J. C. H., & Dayan, P. (1992). Q-Learning. Machine Learning, 8(3–4), 279–292. 10.1007/BF00992698
  2. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
  3. Sutton, R. S. (1988). Learning to Predict by the Methods of Temporal Differences. Machine Learning, 3(1), 9–44. 10.1007/BF00115009
  4. Rummery, G. A., & Niranjan, M. (1994). On-Line Q-Learning Using Connectionist Systems (Techreport CUED/F-INFENG/TR 166). Cambridge University Engineering Department.
  5. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-Level Control through Deep Reinforcement Learning. Nature, 518(7540), 529–533. 10.1038/nature14236