Deep Q-Networks
From Tabular Q-Learning to Function Approximation with Neural Networks
In the previous chapter we developed Q-Learning in the tabular setting: a separate entry for every state-action pair, updated with a temporal-difference rule, guaranteed to converge to under mild conditions Watkins & Dayan, 1992. This is elegant and theoretically satisfying, but it has a fatal practical limitation. The table has one row per state, so the memory and data requirements scale with the size of the state space. For the problems we actually care about — robotic control, game playing from pixels, dialogue — the state space is astronomically large or even continuous, and tabular methods break down entirely.
This chapter takes the step that launched deep reinforcement learning: we replace the Q-table with a neural-network function approximator , trained on the same TD targets we already know. We will see why the naive combination diverges in practice, and how two simple algorithmic ideas — experience replay and target networks — stabilize training enough to scale Q-Learning to raw pixels, culminating in the Deep Q-Network (DQN) algorithm that reached human-level performance on 49 Atari 2600 games Mnih et al., 2015.
Slides for this chapter (open full screen).
From Tables to Function Approximators¶
The Curse of Dimensionality¶
A tabular needs entries. For a simple grid-world this is fine; for an Atari 2600 frame it is hopeless. A single Atari screen is pixels with 3 color channels at 8-bit depth, so the raw state space contains on the order of
distinct frames. Even after the preprocessing used by Mnih et al. (2015) — grayscale, downsampling to , and stacking the last 4 frames — the state space is still far too large to enumerate. A table is not an option.
Tables Have No Generalization¶
A second, subtler problem is that tabular assigns a completely independent value to every state. Two frames that differ only in the position of a single pixel are treated as unrelated cells in the table, and learning about one tells us nothing about the other. For high-dimensional sensory input, where nearby observations almost always share similar values, this absence of generalization means the agent must visit essentially every state — which, by the previous argument, is impossible.
Parametric Function Approximation¶
The fix, imported from supervised learning, is to represent as a parametric function
with chosen independently of . A neural network is a natural choice: it has enough capacity to represent complex value functions, and, unlike the table, its parameters are shared across all states. Two consequences follow. First, is a continuous function of its input so nearby states are mapped to nearby features, and therefore to similar Q-values. Second, a single gradient step on one transition updates the same that defines at every other state. This is the precise sense in which function approximation “generalizes”: what we learn at one state propagates to its neighbours through the shared weights, rather than remaining trapped in a single table cell. The price is an inductive bias — the network assumes that similar inputs should have similar values — which is only useful when that assumption matches the problem.
Figure 2:From a lookup table to a parametric approximator. Left: tabular with one independent entry per cell — infeasible for large and with no generalization between rows. Right: a neural network with shared parameters . A single forward pass returns all action-values at once.
Semi-Gradient TD as Supervised Learning¶
Given a parametric , how do we train it? The Q-Learning update from the previous chapter already points the way. Recall the TD target at step :
If we treat as a regression label for the input , we can fit by minimizing the squared error
The gradient of this loss with respect to is
This is a semi-gradient update Sutton & Barto, 2018: we differentiate but treat the target as if it did not depend on , even though in reality it does (through the bootstrap ). The “semi-gradient” name warns us that this is not the true gradient of any objective — and, as we are about to see, that shortcut is also the source of serious stability problems.
The DQN Architecture¶
The Deep Q-Network of Mnih et al. (2015) is the concrete neural architecture that made function approximation from pixels work. A convolutional network LeCun et al., 1998 maps the stacked frame input to action-values:
Input: the last 4 preprocessed frames, stacked into an tensor. Frame stacking gives the network a short-term history, so that it can perceive motion and velocity.
Three convolutional layers with ReLU activations: 32 filters of size with stride 4, then 64 filters of with stride 2, then 64 filters of with stride 1.
Two fully-connected layers: 512 ReLU units, then a linear output layer with units — one Q-value per discrete action.
The key design choice is the shape of the output. Instead of modelling as a scalar that takes both the state and the action as inputs, the network takes only the state and outputs a vector of Q-values in a single forward pass. At action-selection time we can therefore compute with one network evaluation rather than of them — a crucial optimization when we will call the network many times per environment step.
Figure 3:The DQN convolutional architecture of Mnih et al. (2015). The input is a stack of the last 4 preprocessed frames (). Three convolutional layers — 32 filters with stride 4, 64 filters with stride 2, 64 filters with stride 1, each followed by ReLU — reduce the spatial dimension while growing the channel count. The resulting feature map is flattened and passed through a 512-unit ReLU layer, followed by a linear head with one output per discrete action. A single forward pass returns the full vector of action-values.
The architecture was first prototyped in a NeurIPS workshop paper Mnih et al., 2013 and scaled up to the full 49-game benchmark in the subsequent Nature paper Mnih et al., 2015. The same architecture and hyperparameters were used across all games, with no game-specific tuning.
Naive Deep Q-Learning and Why It Fails¶
Plugging the neural network directly into the Q-Learning update gives the following “naive” training procedure: at each step, take the current transition , compute the TD target , and take a gradient step on
In practice, this diverges. Mnih et al. (2015) identify two concrete failure modes.
Sample Correlation¶
Consecutive transitions produced by a Markovian agent interacting with its environment are highly correlated: depends on , and the agent’s own policy changes slowly over time. Stochastic gradient descent assumes mini-batches that are approximately independent and identically distributed (i.i.d.), but a stream of correlated transitions violates this assumption badly. The result is that gradient estimates are systematically biased toward whatever the agent is doing right now, the network overfits to the latest region of state space, and earlier competence is erased — a phenomenon that in supervised learning is called catastrophic forgetting.
Moving Targets¶
The second failure mode is more insidious. The TD target
depends on the very parameters we are updating. An SGD step that changes also changes itself, which changes the loss landscape, which changes the next step, and so on. The network is, in effect, chasing its own tail: small fluctuations in feed back into the regression targets and are amplified on the next update. In combination with function approximation and off-policy bootstrapping, this positive feedback frequently causes training to diverge Sutton & Barto, 2018.
Experience Replay¶
The first stabilizing idea is experience replay, introduced by Lin (1992) and scaled up by Mnih et al. (2015). Rather than training on transitions in the order they are experienced, the agent stores every transition in a large first-in-first-out buffer of capacity (typically ) and draws uniform random mini-batches from for each gradient step.
Figure 4:Experience replay. As the agent interacts with the environment, every transition is pushed into a FIFO buffer . Training samples are drawn uniformly at random from , not from the live stream.
This gives three distinct benefits:
Decorrelation. Samples in a mini-batch are drawn from different time steps — and often different episodes — so they look much closer to i.i.d. than the raw trajectory. Gradient estimates are less biased, and SGD’s assumptions are (approximately) restored.
Sample efficiency. Each transition can be used many times in gradient updates rather than exactly once, which is especially valuable when environment interaction is slow or expensive.
Smoother training distribution. Averaging over the recent history of behavior reduces the variance of the training distribution and prevents oscillations driven by the agent’s most recent policy.
Experience replay only makes sense if the algorithm is off-policy: because past transitions were collected under older policies that differ from the current -greedy policy, we need a learning rule whose update is valid under a different behavior distribution than the target policy. Q-Learning is off-policy — its update is based on regardless of which action the agent actually took — which is exactly what makes this trick available. (The on-policy counterpart, SARSA, cannot use experience replay without correction.)
A replay buffer is not without cost: maintaining and sampling a large buffer adds memory and computational overhead, and its footprint makes it hard to colocate data with environment simulation and learning on a single GPU in one fused, highly parallel loop Gallici et al., 2024.
Target Networks¶
Experience replay addresses the correlation problem but not the moving-target problem. The second idea, also from Mnih et al. (2015), is to maintain a second copy of the network parameters, denoted , used only to compute the TD target:
The online parameters are updated every SGD step as usual. The target parameters are held fixed and synchronized to every gradient steps via the hard update
Between syncs, is frozen, so the regression label is stationary from the optimizer’s point of view — just like a label in supervised learning. The self-referential feedback loop that caused divergence in the naive scheme is broken: can move freely for steps before the target catches up.
Figure 5:Two copies of the network. The online network is updated every step by gradient descent on the TD loss. The target network (dashed border) is frozen between syncs and used only to evaluate the TD target. Every steps its parameters are overwritten with the current .
The DQN Algorithm¶
Putting experience replay and target networks together with Q-Learning gives the full DQN algorithm of Mnih et al. (2015):
Initialize with random ; set ; initialize an empty replay buffer of capacity .
For each episode:
Observe initial state .
For each step of the episode:
Select from using -greedy on .
Execute ; observe reward , next state , and termination flag .
Store the transition in .
Sample a random mini-batch of transitions from .
For each transition in , compute the TD target
Take an SGD step on .
Every steps, synchronize the target network: .
.
Figure 6:The DQN training loop. An -greedy step on the online network produces a new transition that is pushed into the replay buffer . A random mini-batch from is used to compute the TD target via the frozen target network , followed by an SGD step on the online parameters. Every steps, is overwritten with .
The hyperparameters reported by Mnih et al. (2015) for the Atari benchmark are a good starting point in practice: replay buffer capacity , mini-batch size 32, target-sync period updates, discount , and an that is annealed linearly from 1.0 to 0.1 over the first million frames and held fixed thereafter. The optimizer is RMSProp with a learning rate on the order of . Rewards are clipped to to keep gradient magnitudes comparable across games.
The Atari Breakthrough¶
The significance of DQN is not in any single one of these ideas — neural-network function approximation, experience replay, TD-style losses, and convolutional nets all existed before. The significance is that putting them together produced the first agent that learned to play 49 Atari 2600 games at or above human level directly from raw pixels, using a single architecture, a single set of hyperparameters, and only the game score as a reward signal Mnih et al., 2015. Before DQN, reinforcement learning with neural networks was a niche, unstable affair. After DQN, deep RL became a full research subfield.
Extensions¶
DQN has been refined in many directions since 2015, each addressing a specific limitation. Double DQN Hasselt et al., 2016 decouples action selection from action evaluation in the max operator to reduce the well-known overestimation bias of standard Q-Learning. Prioritized Experience Replay Schaul et al., 2016 samples transitions from with probability proportional to their TD error instead of uniformly, focusing learning on the most informative experiences. Dueling DQN Wang et al., 2016 factorizes into a state-value stream and an advantage stream, improving learning in states where the choice of action matters little. Rainbow Hessel et al., 2018 combines these and three further improvements into a single agent that substantially outperforms vanilla DQN.
Summary¶
This chapter took the step from tabular Q-Learning to deep reinforcement learning:
Function approximation with neural networks replaces the Q-table with . It makes high-dimensional state spaces tractable and gives us generalization between similar states for free.
Naive deep Q-Learning is unstable. Consecutive transitions are correlated, and the TD target depends on the parameters being optimized — a feedback loop that causes divergence.
Experience replay stores past transitions in a buffer and trains on uniform random mini-batches. It decorrelates samples, reuses data, and smooths the training distribution. It requires an off-policy algorithm such as Q-Learning.
Target networks evaluate the TD target with a frozen copy of the parameters, synchronized to every steps. They make the regression targets locally stationary, removing the self-referential feedback loop.
Together, these ideas turn Q-Learning into the DQN algorithm Mnih et al., 2015 that learns Atari from pixels. In the next chapter we leave value-based methods behind and look at policy-based methods, starting with the REINFORCE algorithm — a different way of parameterising and optimizing the agent, one that extends naturally to continuous action spaces where is no longer tractable.
- Watkins, C. J. C. H., & Dayan, P. (1992). Q-Learning. Machine Learning, 8(3–4), 279–292. 10.1007/BF00992698
- Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-Level Control through Deep Reinforcement Learning. Nature, 518(7540), 529–533. 10.1038/nature14236
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
- LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11), 2278–2324. 10.1109/5.726791
- Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing Atari with Deep Reinforcement Learning. arXiv Preprint arXiv:1312.5602. https://arxiv.org/abs/1312.5602
- Lin, L.-J. (1992). Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching. Machine Learning, 8(3–4), 293–321. 10.1007/BF00992699
- Gallici, M., Fellows, M., Ellis, B., Pou, B., Masmitja, I., Foerster, J. N., & Martin, M. (2024). Simplifying Deep Temporal Difference Learning. arXiv Preprint arXiv:2407.04811. https://arxiv.org/abs/2407.04811
- Polyak, B. T., & Juditsky, A. B. (1992). Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization, 30(4), 838–855. 10.1137/0330046
- Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). Continuous Control with Deep Reinforcement Learning. Proceedings of the 4th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1509.02971
- van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1), 2094–2100. 10.1609/aaai.v30i1.10295
- Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized Experience Replay. Proceedings of the 4th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1511.05952
- Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., & de Freitas, N. (2016). Dueling Network Architectures for Deep Reinforcement Learning. Proceedings of the 33rd International Conference on Machine Learning (ICML), 48, 1995–2003. https://proceedings.mlr.press/v48/wangf16.html
- Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., & Silver, D. (2018). Rainbow: Combining Improvements in Deep Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 3215–3222. 10.1609/aaai.v32i1.11796