Deep Q-Networks - Introduction to Deep Reinforcement Learning

In the previous chapter we developed Q-Learning in the tabular setting: a separate entry $Q(s, a)$ for every state-action pair, updated with a temporal-difference rule, guaranteed to converge to $Q^*$ under mild conditions Watkins & Dayan, 1992. This is elegant and theoretically satisfying, but it has a fatal practical limitation. The table has one row per state, so the memory and data requirements scale with the size of the state space. For the problems we actually care about — robotic control, game playing from pixels, dialogue — the state space is astronomically large or even continuous, and tabular methods break down entirely.

This chapter takes the step that launched deep reinforcement learning: we replace the Q-table with a neural-network function approximator $Q_\theta(s, a)$ , trained on the same TD targets we already know. We will see why the naive combination diverges in practice, and how two simple algorithmic ideas — experience replay and target networks — stabilize training enough to scale Q-Learning to raw pixels, culminating in the Deep Q-Network (DQN) algorithm that reached human-level performance on 49 Atari 2600 games Mnih et al., 2015.

Slides for this chapter (open full screen).

From Tables to Function Approximators¶

The Curse of Dimensionality¶

A tabular $Q$ needs $|\mathcal{S}| \cdot |\mathcal{A}|$ entries. For a simple grid-world this is fine; for an Atari 2600 frame it is hopeless. A single Atari screen is $210 \times 160$ pixels with 3 color channels at 8-bit depth, so the raw state space contains on the order of

256^{210 \times 160 \times 3} \approx 10^{242\,823}

(1)

distinct frames. Even after the preprocessing used by Mnih et al. (2015) — grayscale, downsampling to $84 \times 84$ , and stacking the last 4 frames — the state space is still far too large to enumerate. A table is not an option.

Tables Have No Generalization¶

A second, subtler problem is that tabular $Q$ assigns a completely independent value to every state. Two frames that differ only in the position of a single pixel are treated as unrelated cells in the table, and learning about one tells us nothing about the other. For high-dimensional sensory input, where nearby observations almost always share similar values, this absence of generalization means the agent must visit essentially every state — which, by the previous argument, is impossible.

Parametric Function Approximation¶

The fix, imported from supervised learning, is to represent $Q$ as a parametric function

Q_\theta(s, a) \approx Q^*(s, a), \qquad \theta \in \mathbb{R}^d,

(2)

with $d$ chosen independently of $|\mathcal{S}|$ . A neural network is a natural choice: it has enough capacity to represent complex value functions, and, unlike the table, its parameters are shared across all states. Two consequences follow. First, $Q_\theta$ is a continuous function of its input so nearby states are mapped to nearby features, and therefore to similar Q-values. Second, a single gradient step on one transition updates the same $\theta$ that defines $Q_\theta$ at every other state. This is the precise sense in which function approximation “generalizes”: what we learn at one state propagates to its neighbours through the shared weights, rather than remaining trapped in a single table cell. The price is an inductive bias — the network assumes that similar inputs should have similar values — which is only useful when that assumption matches the problem.

From a lookup table to a parametric approximator. Left: tabular Q(s, a) with one independent entry per cell — infeasible for large |\mathcal{S}| and with no generalization between rows. Right: a neural network Q_\theta(s, a) with shared parameters \theta. A single forward pass returns all |\mathcal{A}| action-values at once. — Figure 2:From a lookup table to a parametric approximator. Left: tabular $Q(s, a)$ with one independent entry per cell — infeasible for large $|\mathcal{S}|$ and with no generalization between rows. Right: a neural network $Q_\theta(s, a)$ with shared parameters $\theta$ . A single forward pass returns all $|\mathcal{A}|$ action-values at once.

Semi-Gradient TD as Supervised Learning¶

Given a parametric $Q_\theta$ , how do we train it? The Q-Learning update from the previous chapter already points the way. Recall the TD target at step $t$ :

y_t \;=\; R_{t+1} + \gamma \, \max_{a'} Q(S_{t+1}, a').

(3)

If we treat $y_t$ as a regression label for the input $(S_t, A_t)$ , we can fit $Q_\theta$ by minimizing the squared error

\mathcal{L}(\theta) \;=\; \mathbb{E} \left[ \left( y_t - Q_\theta(S_t, A_t) \right)^2 \right].

(4)

The gradient of this loss with respect to $\theta$ is

\nabla_\theta \mathcal{L}(\theta) \;=\; -2 \, \mathbb{E} \left[ \left( y_t - Q_\theta(S_t, A_t) \right) \nabla_\theta Q_\theta(S_t, A_t) \right].

(5)

This is a semi-gradient update Sutton & Barto, 2018: we differentiate $Q_\theta(S_t, A_t)$ but treat the target $y_t$ as if it did not depend on $\theta$ , even though in reality it does (through the bootstrap $\max_{a'} Q_\theta(S_{t+1}, a')$ ). The “semi-gradient” name warns us that this is not the true gradient of any objective — and, as we are about to see, that shortcut is also the source of serious stability problems.

The DQN Architecture¶

The Deep Q-Network of Mnih et al. (2015) is the concrete neural architecture that made function approximation from pixels work. A convolutional network LeCun et al., 1998 maps the stacked frame input to action-values:

Input: the last 4 preprocessed frames, stacked into an $84 \times 84 \times 4$ tensor. Frame stacking gives the network a short-term history, so that it can perceive motion and velocity.
Three convolutional layers with ReLU activations: 32 filters of size $8 \times 8$ with stride 4, then 64 filters of $4 \times 4$ with stride 2, then 64 filters of $3 \times 3$ with stride 1.
Two fully-connected layers: 512 ReLU units, then a linear output layer with $|\mathcal{A}|$ units — one Q-value per discrete action.

The key design choice is the shape of the output. Instead of modelling $Q_\theta(s, a)$ as a scalar that takes both the state and the action as inputs, the network takes only the state and outputs a vector of $|\mathcal{A}|$ Q-values in a single forward pass. At action-selection time we can therefore compute $\arg\max_a Q_\theta(s, a)$ with one network evaluation rather than $|\mathcal{A}|$ of them — a crucial optimization when we will call the network many times per environment step.

The DQN convolutional architecture of . The input is a stack of the last 4 preprocessed frames (84 \times 84 \times 4). Three convolutional layers — 32 filters 8 \times 8 with stride 4, 64 filters 4 \times 4 with stride 2, 64 filters 3 \times 3 with stride 1, each followed by ReLU — reduce the spatial dimension while growing the channel count. The resulting 7 \times 7 \times 64 feature map is flattened and passed through a 512-unit ReLU layer, followed by a linear head with one output per discrete action. A single forward pass returns the full vector of |\mathcal{A}| action-values. — Figure 3:The DQN convolutional architecture of Mnih *et al.* (2015). The input is a stack of the last 4 preprocessed frames ( $84 \times 84 \times 4$ ). Three convolutional layers — 32 filters $8 \times 8$ with stride 4, 64 filters $4 \times 4$ with stride 2, 64 filters $3 \times 3$ with stride 1, each followed by ReLU — reduce the spatial dimension while growing the channel count. The resulting $7 \times 7 \times 64$ feature map is flattened and passed through a 512-unit ReLU layer, followed by a linear head with one output per discrete action. A single forward pass returns the full vector of $|\mathcal{A}|$ action-values.

The architecture was first prototyped in a NeurIPS workshop paper Mnih et al., 2013 and scaled up to the full 49-game benchmark in the subsequent Nature paper Mnih et al., 2015. The same architecture and hyperparameters were used across all games, with no game-specific tuning.

Naive Deep Q-Learning and Why It Fails¶

Plugging the neural network directly into the Q-Learning update gives the following “naive” training procedure: at each step, take the current transition $(S_t, A_t, R_{t+1}, S_{t+1})$ , compute the TD target $y_t = R_{t+1} + \gamma \max_{a'} Q_\theta(S_{t+1}, a')$ , and take a gradient step on

\mathcal{L}(\theta) \;=\; \left( y_t - Q_\theta(S_t, A_t) \right)^2.

(6)

In practice, this diverges. Mnih et al. (2015) identify two concrete failure modes.

Sample Correlation¶

Consecutive transitions produced by a Markovian agent interacting with its environment are highly correlated: $S_{t+1}$ depends on $S_t$ , and the agent’s own policy changes slowly over time. Stochastic gradient descent assumes mini-batches that are approximately independent and identically distributed (i.i.d.), but a stream of correlated transitions violates this assumption badly. The result is that gradient estimates are systematically biased toward whatever the agent is doing right now, the network overfits to the latest region of state space, and earlier competence is erased — a phenomenon that in supervised learning is called catastrophic forgetting.

Moving Targets¶

The second failure mode is more insidious. The TD target

y_t \;=\; R_{t+1} + \gamma \max_{a'} Q_\theta(S_{t+1}, a')

(7)

depends on the very parameters $\theta$ we are updating. An SGD step that changes $\theta$ also changes $y_t$ itself, which changes the loss landscape, which changes the next step, and so on. The network is, in effect, chasing its own tail: small fluctuations in $\theta$ feed back into the regression targets and are amplified on the next update. In combination with function approximation and off-policy bootstrapping, this positive feedback frequently causes training to diverge Sutton & Barto, 2018.

Experience Replay¶

The first stabilizing idea is experience replay, introduced by Lin (1992) and scaled up by Mnih et al. (2015). Rather than training on transitions in the order they are experienced, the agent stores every transition in a large first-in-first-out buffer $\mathcal{D}$ of capacity $N$ (typically $N = 10^6$ ) and draws uniform random mini-batches from $\mathcal{D}$ for each gradient step.

This gives three distinct benefits:

Decorrelation. Samples in a mini-batch are drawn from different time steps — and often different episodes — so they look much closer to i.i.d. than the raw trajectory. Gradient estimates are less biased, and SGD’s assumptions are (approximately) restored.
Sample efficiency. Each transition can be used many times in gradient updates rather than exactly once, which is especially valuable when environment interaction is slow or expensive.
Smoother training distribution. Averaging over the recent history of behavior reduces the variance of the training distribution and prevents oscillations driven by the agent’s most recent policy.

Experience replay only makes sense if the algorithm is off-policy: because past transitions were collected under older policies that differ from the current $\epsilon$ -greedy policy, we need a learning rule whose update is valid under a different behavior distribution than the target policy. Q-Learning is off-policy — its update is based on $\max_{a'} Q(S', a')$ regardless of which action the agent actually took — which is exactly what makes this trick available. (The on-policy counterpart, SARSA, cannot use experience replay without correction.)

A replay buffer is not without cost: maintaining and sampling a large buffer adds memory and computational overhead, and its footprint makes it hard to colocate data with environment simulation and learning on a single GPU in one fused, highly parallel loop Gallici et al., 2024.

Target Networks¶

Experience replay addresses the correlation problem but not the moving-target problem. The second idea, also from Mnih et al. (2015), is to maintain a second copy of the network parameters, denoted $\theta^-$ , used only to compute the TD target:

y \;=\; r + \gamma \, \max_{a'} Q_{\theta^-}(s', a').

(8)

The online parameters $\theta$ are updated every SGD step as usual. The target parameters $\theta^-$ are held fixed and synchronized to $\theta$ every $C$ gradient steps via the hard update

\theta^- \leftarrow \theta.

(9)

Between syncs, $\theta^-$ is frozen, so the regression label $y$ is stationary from the optimizer’s point of view — just like a label in supervised learning. The self-referential feedback loop that caused divergence in the naive scheme is broken: $\theta$ can move freely for $C$ steps before the target catches up.

Two copies of the network. The online network Q_\theta is updated every step by gradient descent on the TD loss. The target network Q_{\theta^-} (dashed border) is frozen between syncs and used only to evaluate the TD target. Every C steps its parameters are overwritten with the current \theta. — Figure 5:Two copies of the network. The online network $Q_\theta$ is updated every step by gradient descent on the TD loss. The target network $Q_{\theta^-}$ (dashed border) is frozen between syncs and used only to evaluate the TD target. Every $C$ steps its parameters are overwritten with the current $\theta$ .

The DQN Algorithm¶

Putting experience replay and target networks together with Q-Learning gives the full DQN algorithm of Mnih et al. (2015):

Initialize $Q_\theta$ with random $\theta$ ; set $\theta^- \leftarrow \theta$ ; initialize an empty replay buffer $\mathcal{D}$ of capacity $N$ .
For each episode:
1. Observe initial state $S$ .
2. For each step $t$ of the episode:
  1. Select $A$ from $S$ using $\epsilon$ -greedy on $Q_\theta(S, \cdot)$ .
  2. Execute $A$ ; observe reward $R$ , next state $S'$ , and termination flag $\text{done}$ .
  3. Store the transition $(S, A, R, S', \text{done})$ in $\mathcal{D}$ .
  4. Sample a random mini-batch $\mathcal{B}$ of transitions $(s, a, r, s', \text{done})$ from $\mathcal{D}$ .
  5. For each transition in $\mathcal{B}$ , compute the TD target
    $y = \begin{cases} r & \text{if } \text{done},\\ r + \gamma \max_{a'} Q_{\theta^-}(s', a') & \text{otherwise.}\end{cases}$
    (10)
  6. Take an SGD step on $\mathcal{L}(\theta) = \frac{1}{|\mathcal{B}|} \sum_{\mathcal{B}} \left( y - Q_\theta(s, a) \right)^2$ .
  7. Every $C$ steps, synchronize the target network: $\theta^- \leftarrow \theta$ .
  8. $S \leftarrow S'$ .

The DQN training loop. An \epsilon-greedy step on the online network Q_\theta produces a new transition that is pushed into the replay buffer \mathcal{D}. A random mini-batch from \mathcal{D} is used to compute the TD target via the frozen target network Q_{\theta^-}, followed by an SGD step on the online parameters. Every C steps, \theta^- is overwritten with \theta. — Figure 6:The DQN training loop. An $\epsilon$ -greedy step on the online network $Q_\theta$ produces a new transition that is pushed into the replay buffer $\mathcal{D}$ . A random mini-batch from $\mathcal{D}$ is used to compute the TD target via the frozen target network $Q_{\theta^-}$ , followed by an SGD step on the online parameters. Every $C$ steps, $\theta^-$ is overwritten with $\theta$ .

The hyperparameters reported by Mnih et al. (2015) for the Atari benchmark are a good starting point in practice: replay buffer capacity $N = 10^6$ , mini-batch size 32, target-sync period $C = 10{,}000$ updates, discount $\gamma = 0.99$ , and an $\epsilon$ that is annealed linearly from 1.0 to 0.1 over the first million frames and held fixed thereafter. The optimizer is RMSProp with a learning rate on the order of $2.5 \times 10^{-4}$ . Rewards are clipped to $\{-1, 0, +1\}$ to keep gradient magnitudes comparable across games.

The Atari Breakthrough¶

The significance of DQN is not in any single one of these ideas — neural-network function approximation, experience replay, TD-style losses, and convolutional nets all existed before. The significance is that putting them together produced the first agent that learned to play 49 Atari 2600 games at or above human level directly from raw pixels, using a single architecture, a single set of hyperparameters, and only the game score as a reward signal Mnih et al., 2015. Before DQN, reinforcement learning with neural networks was a niche, unstable affair. After DQN, deep RL became a full research subfield.

Extensions¶

DQN has been refined in many directions since 2015, each addressing a specific limitation. Double DQN Hasselt et al., 2016 decouples action selection from action evaluation in the max operator to reduce the well-known overestimation bias of standard Q-Learning. Prioritized Experience Replay Schaul et al., 2016 samples transitions from $\mathcal{D}$ with probability proportional to their TD error instead of uniformly, focusing learning on the most informative experiences. Dueling DQN Wang et al., 2016 factorizes $Q_\theta$ into a state-value stream and an advantage stream, improving learning in states where the choice of action matters little. Rainbow Hessel et al., 2018 combines these and three further improvements into a single agent that substantially outperforms vanilla DQN.

Summary¶

This chapter took the step from tabular Q-Learning to deep reinforcement learning:

Function approximation with neural networks replaces the Q-table with $Q_\theta(s, a)$ . It makes high-dimensional state spaces tractable and gives us generalization between similar states for free.
Naive deep Q-Learning is unstable. Consecutive transitions are correlated, and the TD target depends on the parameters being optimized — a feedback loop that causes divergence.
Experience replay stores past transitions in a buffer and trains on uniform random mini-batches. It decorrelates samples, reuses data, and smooths the training distribution. It requires an off-policy algorithm such as Q-Learning.
Target networks evaluate the TD target with a frozen copy $\theta^-$ of the parameters, synchronized to $\theta$ every $C$ steps. They make the regression targets locally stationary, removing the self-referential feedback loop.

Together, these ideas turn Q-Learning into the DQN algorithm Mnih et al., 2015 that learns Atari from pixels. In the next chapter we leave value-based methods behind and look at policy-based methods, starting with the REINFORCE algorithm — a different way of parameterising and optimizing the agent, one that extends naturally to continuous action spaces where $\arg\max_a Q_\theta(s, a)$ is no longer tractable.

References¶

Watkins, C. J. C. H., & Dayan, P. (1992). Q-Learning. Machine Learning, 8(3–4), 279–292. 10.1007/BF00992698
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-Level Control through Deep Reinforcement Learning. Nature, 518(7540), 529–533. 10.1038/nature14236
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. http://incompleteideas.net/book/the-book-2nd.html
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11), 2278–2324. 10.1109/5.726791
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing Atari with Deep Reinforcement Learning. arXiv Preprint arXiv:1312.5602. https://arxiv.org/abs/1312.5602
Lin, L.-J. (1992). Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching. Machine Learning, 8(3–4), 293–321. 10.1007/BF00992699
Gallici, M., Fellows, M., Ellis, B., Pou, B., Masmitja, I., Foerster, J. N., & Martin, M. (2024). Simplifying Deep Temporal Difference Learning. arXiv Preprint arXiv:2407.04811. https://arxiv.org/abs/2407.04811
Polyak, B. T., & Juditsky, A. B. (1992). Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization, 30(4), 838–855. 10.1137/0330046
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., & Wierstra, D. (2016). Continuous Control with Deep Reinforcement Learning. Proceedings of the 4th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1509.02971
van Hasselt, H., Guez, A., & Silver, D. (2016). Deep Reinforcement Learning with Double Q-learning. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1), 2094–2100. 10.1609/aaai.v30i1.10295
Schaul, T., Quan, J., Antonoglou, I., & Silver, D. (2016). Prioritized Experience Replay. Proceedings of the 4th International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1511.05952
Wang, Z., Schaul, T., Hessel, M., van Hasselt, H., Lanctot, M., & de Freitas, N. (2016). Dueling Network Architectures for Deep Reinforcement Learning. Proceedings of the 33rd International Conference on Machine Learning (ICML), 48, 1995–2003. https://proceedings.mlr.press/v48/wangf16.html
Hessel, M., Modayil, J., van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., & Silver, D. (2018). Rainbow: Combining Improvements in Deep Reinforcement Learning. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 3215–3222. 10.1609/aaai.v32i1.11796