Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Deep Q-Networks

From Tabular Q-Learning to Function Approximation with Neural Networks

In the previous chapter we developed Q-Learning in the tabular setting: a separate entry Q(s,a)Q(s, a) for every state-action pair, updated with a temporal-difference rule, guaranteed to converge to QQ^* under mild conditions Watkins & Dayan, 1992. This is elegant and theoretically satisfying, but it has a fatal practical limitation. The table has one row per state, so the memory and data requirements scale with the size of the state space. For the problems we actually care about — robotic control, game playing from pixels, dialogue — the state space is astronomically large or even continuous, and tabular methods break down entirely.

This chapter takes the step that launched deep reinforcement learning: we replace the Q-table with a neural-network function approximator Qθ(s,a)Q_\theta(s, a), trained on the same TD targets we already know. We will see why the naive combination diverges in practice, and how two simple algorithmic ideas — experience replay and target networks — stabilize training enough to scale Q-Learning to raw pixels, culminating in the Deep Q-Network (DQN) algorithm that reached human-level performance on 49 Atari 2600 games Mnih et al., 2015.

Slides for this chapter (open full screen).

From Tables to Function Approximators

The Curse of Dimensionality

A tabular QQ needs SA|\mathcal{S}| \cdot |\mathcal{A}| entries. For a simple grid-world this is fine; for an Atari 2600 frame it is hopeless. A single Atari screen is 210×160210 \times 160 pixels with 3 color channels at 8-bit depth, so the raw state space contains on the order of

256210×160×310242823256^{210 \times 160 \times 3} \approx 10^{242\,823}

distinct frames. Even after the preprocessing used by Mnih et al. (2015) — grayscale, downsampling to 84×8484 \times 84, and stacking the last 4 frames — the state space is still far too large to enumerate. A table is not an option.

Tables Have No Generalization

A second, subtler problem is that tabular QQ assigns a completely independent value to every state. Two frames that differ only in the position of a single pixel are treated as unrelated cells in the table, and learning about one tells us nothing about the other. For high-dimensional sensory input, where nearby observations almost always share similar values, this absence of generalization means the agent must visit essentially every state — which, by the previous argument, is impossible.

Parametric Function Approximation

The fix, imported from supervised learning, is to represent QQ as a parametric function

Qθ(s,a)Q(s,a),θRd,Q_\theta(s, a) \approx Q^*(s, a), \qquad \theta \in \mathbb{R}^d,

with dd chosen independently of S|\mathcal{S}|. A neural network is a natural choice: it has enough capacity to represent complex value functions, and, unlike the table, its parameters are shared across all states. Two consequences follow. First, QθQ_\theta is a continuous function of its input so nearby states are mapped to nearby features, and therefore to similar Q-values. Second, a single gradient step on one transition updates the same θ\theta that defines QθQ_\theta at every other state. This is the precise sense in which function approximation “generalizes”: what we learn at one state propagates to its neighbours through the shared weights, rather than remaining trapped in a single table cell. The price is an inductive bias — the network assumes that similar inputs should have similar values — which is only useful when that assumption matches the problem.

From a lookup table to a parametric approximator. Left: tabular Q(s, a) with one independent entry per cell — infeasible for large |\mathcal{S}| and with no generalization between rows. Right: a neural network Q_\theta(s, a) with shared parameters \theta. A single forward pass returns all |\mathcal{A}| action-values at once.

Figure 2:From a lookup table to a parametric approximator. Left: tabular Q(s,a)Q(s, a) with one independent entry per cell — infeasible for large S|\mathcal{S}| and with no generalization between rows. Right: a neural network Qθ(s,a)Q_\theta(s, a) with shared parameters θ\theta. A single forward pass returns all A|\mathcal{A}| action-values at once.

Semi-Gradient TD as Supervised Learning

Given a parametric QθQ_\theta, how do we train it? The Q-Learning update from the previous chapter already points the way. Recall the TD target at step tt:

yt  =  Rt+1+γmaxaQ(St+1,a).y_t \;=\; R_{t+1} + \gamma \, \max_{a'} Q(S_{t+1}, a').

If we treat yty_t as a regression label for the input (St,At)(S_t, A_t), we can fit QθQ_\theta by minimizing the squared error

L(θ)  =  E[(ytQθ(St,At))2].\mathcal{L}(\theta) \;=\; \mathbb{E} \left[ \left( y_t - Q_\theta(S_t, A_t) \right)^2 \right].

The gradient of this loss with respect to θ\theta is

θL(θ)  =  2E[(ytQθ(St,At))θQθ(St,At)].\nabla_\theta \mathcal{L}(\theta) \;=\; -2 \, \mathbb{E} \left[ \left( y_t - Q_\theta(S_t, A_t) \right) \nabla_\theta Q_\theta(S_t, A_t) \right].

This is a semi-gradient update Sutton & Barto, 2018: we differentiate Qθ(St,At)Q_\theta(S_t, A_t) but treat the target yty_t as if it did not depend on θ\theta, even though in reality it does (through the bootstrap maxaQθ(St+1,a)\max_{a'} Q_\theta(S_{t+1}, a')). The “semi-gradient” name warns us that this is not the true gradient of any objective — and, as we are about to see, that shortcut is also the source of serious stability problems.

The DQN Architecture

The Deep Q-Network of Mnih et al. (2015) is the concrete neural architecture that made function approximation from pixels work. A convolutional network LeCun et al., 1998 maps the stacked frame input to action-values:

The key design choice is the shape of the output. Instead of modelling Qθ(s,a)Q_\theta(s, a) as a scalar that takes both the state and the action as inputs, the network takes only the state and outputs a vector of A|\mathcal{A}| Q-values in a single forward pass. At action-selection time we can therefore compute argmaxaQθ(s,a)\arg\max_a Q_\theta(s, a) with one network evaluation rather than A|\mathcal{A}| of them — a crucial optimization when we will call the network many times per environment step.

The DQN convolutional architecture of . The input is a stack of the last 4 preprocessed frames (84 \times 84 \times 4). Three convolutional layers — 32 filters 8 \times 8 with stride 4, 64 filters 4 \times 4 with stride 2, 64 filters 3 \times 3 with stride 1, each followed by ReLU — reduce the spatial dimension while growing the channel count. The resulting 7 \times 7 \times 64 feature map is flattened and passed through a 512-unit ReLU layer, followed by a linear head with one output per discrete action. A single forward pass returns the full vector of |\mathcal{A}| action-values.

Figure 3:The DQN convolutional architecture of Mnih et al. (2015). The input is a stack of the last 4 preprocessed frames (84×84×484 \times 84 \times 4). Three convolutional layers — 32 filters 8×88 \times 8 with stride 4, 64 filters 4×44 \times 4 with stride 2, 64 filters 3×33 \times 3 with stride 1, each followed by ReLU — reduce the spatial dimension while growing the channel count. The resulting 7×7×647 \times 7 \times 64 feature map is flattened and passed through a 512-unit ReLU layer, followed by a linear head with one output per discrete action. A single forward pass returns the full vector of A|\mathcal{A}| action-values.

The architecture was first prototyped in a NeurIPS workshop paper Mnih et al., 2013 and scaled up to the full 49-game benchmark in the subsequent Nature paper Mnih et al., 2015. The same architecture and hyperparameters were used across all games, with no game-specific tuning.

Naive Deep Q-Learning and Why It Fails

Plugging the neural network directly into the Q-Learning update gives the following “naive” training procedure: at each step, take the current transition (St,At,Rt+1,St+1)(S_t, A_t, R_{t+1}, S_{t+1}), compute the TD target yt=Rt+1+γmaxaQθ(St+1,a)y_t = R_{t+1} + \gamma \max_{a'} Q_\theta(S_{t+1}, a'), and take a gradient step on

L(θ)  =  (ytQθ(St,At))2.\mathcal{L}(\theta) \;=\; \left( y_t - Q_\theta(S_t, A_t) \right)^2.

In practice, this diverges. Mnih et al. (2015) identify two concrete failure modes.

Sample Correlation

Consecutive transitions produced by a Markovian agent interacting with its environment are highly correlated: St+1S_{t+1} depends on StS_t, and the agent’s own policy changes slowly over time. Stochastic gradient descent assumes mini-batches that are approximately independent and identically distributed (i.i.d.), but a stream of correlated transitions violates this assumption badly. The result is that gradient estimates are systematically biased toward whatever the agent is doing right now, the network overfits to the latest region of state space, and earlier competence is erased — a phenomenon that in supervised learning is called catastrophic forgetting.

Moving Targets

The second failure mode is more insidious. The TD target

yt  =  Rt+1+γmaxaQθ(St+1,a)y_t \;=\; R_{t+1} + \gamma \max_{a'} Q_\theta(S_{t+1}, a')

depends on the very parameters θ\theta we are updating. An SGD step that changes θ\theta also changes yty_t itself, which changes the loss landscape, which changes the next step, and so on. The network is, in effect, chasing its own tail: small fluctuations in θ\theta feed back into the regression targets and are amplified on the next update. In combination with function approximation and off-policy bootstrapping, this positive feedback frequently causes training to diverge Sutton & Barto, 2018.

Experience Replay

The first stabilizing idea is experience replay, introduced by Lin (1992) and scaled up by Mnih et al. (2015). Rather than training on transitions in the order they are experienced, the agent stores every transition in a large first-in-first-out buffer D\mathcal{D} of capacity NN (typically N=106N = 10^6) and draws uniform random mini-batches from D\mathcal{D} for each gradient step.

Experience replay. As the agent interacts with the environment, every transition (s, a, r, s', \text{done}) is pushed into a FIFO buffer \mathcal{D}. Training samples are drawn uniformly at random from \mathcal{D}, not from the live stream.

Figure 4:Experience replay. As the agent interacts with the environment, every transition (s,a,r,s,done)(s, a, r, s', \text{done}) is pushed into a FIFO buffer D\mathcal{D}. Training samples are drawn uniformly at random from D\mathcal{D}, not from the live stream.

This gives three distinct benefits:

  1. Decorrelation. Samples in a mini-batch are drawn from different time steps — and often different episodes — so they look much closer to i.i.d. than the raw trajectory. Gradient estimates are less biased, and SGD’s assumptions are (approximately) restored.

  2. Sample efficiency. Each transition can be used many times in gradient updates rather than exactly once, which is especially valuable when environment interaction is slow or expensive.

  3. Smoother training distribution. Averaging over the recent history of behavior reduces the variance of the training distribution and prevents oscillations driven by the agent’s most recent policy.

Experience replay only makes sense if the algorithm is off-policy: because past transitions were collected under older policies that differ from the current ϵ\epsilon-greedy policy, we need a learning rule whose update is valid under a different behavior distribution than the target policy. Q-Learning is off-policy — its update is based on maxaQ(S,a)\max_{a'} Q(S', a') regardless of which action the agent actually took — which is exactly what makes this trick available. (The on-policy counterpart, SARSA, cannot use experience replay without correction.)

A replay buffer is not without cost: maintaining and sampling a large buffer adds memory and computational overhead, and its footprint makes it hard to colocate data with environment simulation and learning on a single GPU in one fused, highly parallel loop Gallici et al., 2024.

Target Networks

Experience replay addresses the correlation problem but not the moving-target problem. The second idea, also from Mnih et al. (2015), is to maintain a second copy of the network parameters, denoted θ\theta^-, used only to compute the TD target:

y  =  r+γmaxaQθ(s,a).y \;=\; r + \gamma \, \max_{a'} Q_{\theta^-}(s', a').

The online parameters θ\theta are updated every SGD step as usual. The target parameters θ\theta^- are held fixed and synchronized to θ\theta every CC gradient steps via the hard update

θθ.\theta^- \leftarrow \theta.

Between syncs, θ\theta^- is frozen, so the regression label yy is stationary from the optimizer’s point of view — just like a label in supervised learning. The self-referential feedback loop that caused divergence in the naive scheme is broken: θ\theta can move freely for CC steps before the target catches up.

Two copies of the network. The online network Q_\theta is updated every step by gradient descent on the TD loss. The target network Q_{\theta^-} (dashed border) is frozen between syncs and used only to evaluate the TD target. Every C steps its parameters are overwritten with the current \theta.

Figure 5:Two copies of the network. The online network QθQ_\theta is updated every step by gradient descent on the TD loss. The target network QθQ_{\theta^-} (dashed border) is frozen between syncs and used only to evaluate the TD target. Every CC steps its parameters are overwritten with the current θ\theta.

The DQN Algorithm

Putting experience replay and target networks together with Q-Learning gives the full DQN algorithm of Mnih et al. (2015):

  1. Initialize QθQ_\theta with random θ\theta; set θθ\theta^- \leftarrow \theta; initialize an empty replay buffer D\mathcal{D} of capacity NN.

  2. For each episode:

    1. Observe initial state SS.

    2. For each step tt of the episode:

      1. Select AA from SS using ϵ\epsilon-greedy on Qθ(S,)Q_\theta(S, \cdot).

      2. Execute AA; observe reward RR, next state SS', and termination flag done\text{done}.

      3. Store the transition (S,A,R,S,done)(S, A, R, S', \text{done}) in D\mathcal{D}.

      4. Sample a random mini-batch B\mathcal{B} of transitions (s,a,r,s,done)(s, a, r, s', \text{done}) from D\mathcal{D}.

      5. For each transition in B\mathcal{B}, compute the TD target

        y={rif done,r+γmaxaQθ(s,a)otherwise.y = \begin{cases} r & \text{if } \text{done},\\ r + \gamma \max_{a'} Q_{\theta^-}(s', a') & \text{otherwise.}\end{cases}
      6. Take an SGD step on L(θ)=1BB(yQθ(s,a))2\mathcal{L}(\theta) = \frac{1}{|\mathcal{B}|} \sum_{\mathcal{B}} \left( y - Q_\theta(s, a) \right)^2.

      7. Every CC steps, synchronize the target network: θθ\theta^- \leftarrow \theta.

      8. SSS \leftarrow S'.

The DQN training loop. An \epsilon-greedy step on the online network Q_\theta produces a new transition that is pushed into the replay buffer \mathcal{D}. A random mini-batch from \mathcal{D} is used to compute the TD target via the frozen target network Q_{\theta^-}, followed by an SGD step on the online parameters. Every C steps, \theta^- is overwritten with \theta.

Figure 6:The DQN training loop. An ϵ\epsilon-greedy step on the online network QθQ_\theta produces a new transition that is pushed into the replay buffer D\mathcal{D}. A random mini-batch from D\mathcal{D} is used to compute the TD target via the frozen target network QθQ_{\theta^-}, followed by an SGD step on the online parameters. Every CC steps, θ\theta^- is overwritten with θ\theta.

The hyperparameters reported by Mnih et al. (2015) for the Atari benchmark are a good starting point in practice: replay buffer capacity N=106N = 10^6, mini-batch size 32, target-sync period C=10,000C = 10{,}000 updates, discount γ=0.99\gamma = 0.99, and an ϵ\epsilon that is annealed linearly from 1.0 to 0.1 over the first million frames and held fixed thereafter. The optimizer is RMSProp with a learning rate on the order of 2.5×1042.5 \times 10^{-4}. Rewards are clipped to {1,0,+1}\{-1, 0, +1\} to keep gradient magnitudes comparable across games.

The Atari Breakthrough

The significance of DQN is not in any single one of these ideas — neural-network function approximation, experience replay, TD-style losses, and convolutional nets all existed before. The significance is that putting them together produced the first agent that learned to play 49 Atari 2600 games at or above human level directly from raw pixels, using a single architecture, a single set of hyperparameters, and only the game score as a reward signal Mnih et al., 2015. Before DQN, reinforcement learning with neural networks was a niche, unstable affair. After DQN, deep RL became a full research subfield.

Extensions

DQN has been refined in many directions since 2015, each addressing a specific limitation. Double DQN Hasselt et al., 2016 decouples action selection from action evaluation in the max operator to reduce the well-known overestimation bias of standard Q-Learning. Prioritized Experience Replay Schaul et al., 2016 samples transitions from D\mathcal{D} with probability proportional to their TD error instead of uniformly, focusing learning on the most informative experiences. Dueling DQN Wang et al., 2016 factorizes QθQ_\theta into a state-value stream and an advantage stream, improving learning in states where the choice of action matters little. Rainbow Hessel et al., 2018 combines these and three further improvements into a single agent that substantially outperforms vanilla DQN.

Summary

This chapter took the step from tabular Q-Learning to deep reinforcement learning:

Together, these ideas turn Q-Learning into the DQN algorithm Mnih et al., 2015 that learns Atari from pixels. In the next chapter we leave value-based methods behind and look at policy-based methods, starting with the REINFORCE algorithm — a different way of parameterising and optimizing the agent, one that extends naturally to continuous action spaces where argmaxaQθ(s,a)\arg\max_a Q_\theta(s, a) is no longer tractable.

References
  1. Watkins, C. J. C. H., & Dayan, P. (1992). Q-Learning. Machine Learning, 8(3–4), 279–292. 10.1007/BF00992698
  2. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-Level Control through Deep Reinforcement Learning. Nature, 518(7540), 529–533. 10.1038/nature14236
  3. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
  4. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11), 2278–2324. 10.1109/5.726791
  5. Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing Atari with Deep Reinforcement Learning. arXiv Preprint arXiv:1312.5602. https://arxiv.org/abs/1312.5602
  6. Lin, L.-J. (1992). Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching. Machine Learning, 8(3–4), 293–321. 10.1007/BF00992699
  7. Gallici, M., Fellows, M., Ellis, B., Pou, B., Masmitja, I., Foerster, J. N., & Martin, M. (2024). Simplifying Deep Temporal Difference Learning. arXiv Preprint arXiv:2407.04811. https://arxiv.org/abs/2407.04811
  8. Polyak, B. T., & Juditsky, A. B. (1992). Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization, 30(4), 838–855. 10.1137/0330046
  9. Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). Continuous Control with Deep Reinforcement Learning. Proceedings of the 4th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1509.02971
  10. van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1), 2094–2100. 10.1609/aaai.v30i1.10295
  11. Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized Experience Replay. Proceedings of the 4th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1511.05952
  12. Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., & de Freitas, N. (2016). Dueling Network Architectures for Deep Reinforcement Learning. Proceedings of the 33rd International Conference on Machine Learning (ICML), 48, 1995–2003. https://proceedings.mlr.press/v48/wangf16.html
  13. Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., & Silver, D. (2018). Rainbow: Combining Improvements in Deep Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 3215–3222. 10.1609/aaai.v32i1.11796