Deep Q-Learning [RL] Explained & How To Example

Imagine teaching a robot to navigate a maze or training an AI to master a video game without ever giving it explicit instructions—only rewarding it when it does something right. This is the essence of Reinforcement Learning (RL), a branch of machine learning in which agents learn by interacting with an environment to achieve a goal. One of the most foundational and widely used algorithms in this field is Q-Learning.

Table of Contents

Q-learning stands out because it’s simple to understand, easy to implement, and doesn’t require an environment model—making it a perfect starting point for anyone new to reinforcement learning. It forms the basis of more advanced techniques like Deep Q-Networks (DQN) and is behind many real-world applications, from autonomous driving to personalized recommendations.

In this post, we’ll explain how Q-Learning works and its core concepts, walk through a step-by-step example and highlight when (and when not) to use it. Whether you’re a curious beginner or looking to refresh your RL knowledge, this guide will help you build a strong foundation in Q-Learning.

What is Q-Learning?

Q-learning is a type of model-free reinforcement learning algorithm. That means it enables an agent to learn how to act optimally in a given environment without knowing how it works in advance.

At its core, Q-Learning helps an agent learn a policy—a strategy for choosing actions—by estimating the expected long-term reward of a specific action in a given state. This estimate is captured in a data structure called the Q-table, where:

Each row corresponds to a possible state the agent can be in.
Each column represents an action the agent can take.
The entries in the table are called Q-values and represent the agent’s current best estimate of the total future reward it can expect if it starts from that state and takes that action.

The name Q stands for “quality,” which is the quality of a particular state-action combination.

How Q-Learning Works

The agent interacts with the environment over a series of steps:

It observes its current state.
It chooses an action (often by exploring or exploiting what it already knows).
The environment responds with a reward and a new state.
The agent updates the Q-value in its table using the observed reward and its estimate of future rewards.

q-learning explained witha a mouse navigating a maze and updating it's internal staate

Over time, as the agent explores more and updates its Q-table, it learns which actions yield the most reward in each state. Once trained, it can use the Q-table to make decisions that maximize its long-term success.

Key Concepts

Before diving deeper into Q-learning, it’s essential to understand a few fundamental concepts in reinforcement learning. These building blocks define how the agent interacts with and learns from its environment.

Agent

The agent is the learner or decision-maker. The entity takes actions in the environment to achieve a goal (e.g., a robot, a game-playing AI, or a virtual assistant).

Environment

The environment is everything the agent interacts with. It responds to the agent’s actions and provides feedback through new states and rewards.

States (s)

A state is a specific situation or configuration in which the agent finds itself. For example, in a grid world, a state might be the agent’s current location on the grid.

Actions (a)

An action is a choice the agent makes at each state. In a game, this might be “move left” or “jump.” The set of possible actions may depend on the state.

Rewards (r)

A reward is a numerical signal that tells the agent how good or bad its last action was. The agent aims to maximize the total reward it receives over time.

Q-Values (Q(s, a))

A Q-value estimates the expected total future reward the agent will receive by taking action in the states and following the optimal policy. These values are stored in the Q-table.

Exploration vs. Exploitation

Exploration: Trying new actions to discover their potential rewards.
Exploitation: Choosing the action that has the highest known reward so far.
Q-learning balances these through strategies like ε-greedy, which picks a random action with probability ε and the best-known action with probability 1–ε.

Learning Rate (α)

The learning rate controls how much new information overrides old knowledge. A higher α means the agent learns faster but might forget helpful past experiences too quickly.

Discount Factor (γ)

The discount factor determines how much future rewards are valued compared to immediate ones. A γ close to 1 indicates that the agent cares more about long-term rewards, while a lower γ focuses more on immediate payoff.

Policy

A policy is the agent’s strategy for mapping states to actions. Q-learning helps the agent discover an optimal policy by maximizing Q-values over time.

The Q-Learning Algorithm

Q-learning is an off-policy, model-free reinforcement learning algorithm that enables an agent to learn the best actions to take in an environment using trial and error. The algorithm updates a Q-table, which stores estimates of the total expected reward for taking a specific action in a particular state.

The Core Idea

Q-Learning uses the Bellman Equation to update its Q-values iteratively. After each action, the agent updates the value of a state-action pair using this formula:

Where:

s: current state
a: action taken
R: reward received after taking action a
s′: next state
max⁡a′Q(s′,a′)\max_{a’} Q(s’, a’)maxa′Q(s′,a′): best possible Q-value from the next state
α: learning rate
γ: discount factor

This update rule helps the agent learn not just from immediate rewards but also from expected future rewards.

Step-by-Step Breakdown

Here’s how the Q-Learning algorithm works in practice:

Initialize the Q-table: For all state-action pairs (s, a)(s, a)(s, a), initialize Q-values arbitrarily (usually zero).
For each episode (one complete run through the environment): Start in the initial state s.
Repeat (for each step within the episode):
- Choose an action a using an exploration strategy (e.g., ε-greedy).
  - With probability ε, choose a random action (explore).
  - With probability 1–ε, choose the action with the highest Q-value for the current state (exploit).
- Take the action a, observe the reward r and the next state s′.
- Update the Q-value using the Bellman update equation.
- Set the new state s=s′.
Repeat until the episode ends (e.g., reaching a goal state or a maximum number of steps).
Repeat for many episodes, gradually allowing the Q-table to converge to optimal values.

Pseudocode

Initialize Q(s, a) arbitrarily
For each episode:
    Initialize state s
    Repeat until s is terminal:
        Choose action a using ε-greedy policy
        Take action a, observe reward r and next state s'
        Q(s, a) ← Q(s, a) + α [r + γ * max_a' Q(s', a') - Q(s, a)]
        s ← s'

Convergence

If all actions are eventually tried in all states, and the learning rate decays appropriately over time, Q-Learning is guaranteed to converge to the optimal policy.

Example: Q-Learning in Gridworld

To make Q-Learning more concrete, let’s walk through a simple example using a Gridworld—a small environment where an agent learns to navigate to a goal.

The Environment

Imagine a 4×4 grid. The agent starts at the top-left corner (0,0) and wants to reach the goal at the bottom-right corner (3,3). The rules are:

The agent can move up, down, left, or right.
Each move gives a reward of -1 to encourage the agent to reach the goal faster.
Reaching the goal provides a reward of +10, and ends the episode.
The environment has no obstacles for simplicity.

Step 1: Initialise

Create a Q-table with all states and actions initialized to 0. For a 4×4 grid and 4 actions, this means a table of shape (16, 4).

Step 2: Implement Q-Learning

We apply the Q-learning algorithm to help the agent learn the best way to reach the goal.

Python Code Example

import numpy as np
import random

# Gridworld setup
grid_size = 4
n_states = grid_size * grid_size
n_actions = 4  # up, down, left, right
q_table = np.zeros((n_states, n_actions))

# Parameters
alpha = 0.1      # learning rate
gamma = 0.99     # discount factor
epsilon = 0.1    # exploration rate
episodes = 500

# Action mapping
actions = {
    0: (-1, 0),  # up
    1: (1, 0),   # down
    2: (0, -1),  # left
    3: (0, 1)    # right
}

def state_to_index(x, y):
    return x * grid_size + y

def is_valid(x, y):
    return 0 <= x < grid_size and 0 <= y < grid_size

for ep in range(episodes):
    x, y = 0, 0  # Start at top-left
    while (x, y) != (3, 3):  # Goal at bottom-right
        state = state_to_index(x, y)
        
        # ε-greedy action selection
        if random.uniform(0, 1) < epsilon:
            action = random.randint(0, 3)
        else:
            action = np.argmax(q_table[state])
        
        dx, dy = actions[action]
        nx, ny = x + dx, y + dy

        if not is_valid(nx, ny):
            nx, ny = x, y  # stay in place if invalid move

        next_state = state_to_index(nx, ny)
        reward = 10 if (nx, ny) == (3, 3) else -1

        # Q-value update
        q_table[state, action] += alpha * (
            reward + gamma * np.max(q_table[next_state]) - q_table[state, action]
        )

        x, y = nx, ny  # Move to next state

Step 3: Result

After training, the Q-table contains learned values that guide the agent toward the goal with minimal steps. You can now use it to extract an optimal policy—the best action from any position.

Why This Works

Even with simple rules and no knowledge of the environment’s dynamics, Q-Learning helps the agent learn from experience. Over many episodes, it discovers that shorter paths yield higher rewards and adjusts its behaviour accordingly.

Advantages and Limitations

Q-learning is a foundational reinforcement learning algorithm with many strengths but not without challenges. Understanding its pros and cons will help you decide when it’s the right tool for the job—and when to look for more advanced alternatives.

Advantages

Model-Free Learning

Q-Learning doesn’t require a model of the environment’s dynamics. It learns purely from interaction, making it suitable for complex or unknown systems.

Simple and Intuitive

The algorithm is easy to understand and implement. With just a Q-table and a few parameters, you can train an agent to learn intelligent behaviour.

Convergence Guarantee

Given enough exploration and an appropriate learning rate decay, Q-Learning is mathematically proven to converge to the optimal policy.

Versatile Across Domains

It works well for a wide range of discrete, small-scale problems, such as maze navigation, basic games, and simple simulations.

Limitations

Doesn’t Scale Well

Q-learning relies on storing a Q-value for every state-action pair. In environments with large or continuous state spaces, the Q-table becomes too large to manage effectively.

Requires Discretization

For continuous states (like position or velocity), you must discretise the environment, which can introduce approximation errors and limit performance.

Slow Learning in Large Spaces

Q-learning can converge slowly, especially when the state-action space is large and the agent needs to explore many paths.

Exploration Strategy Needs Tuning

Balancing exploration (trying new actions) and exploitation (choosing the best-known actions) requires careful tuning of ε and decay rates. Poor choices can lead to suboptimal learning.

Doesn’t Handle Stochastic Policies Well

Q-learning assumes a deterministic policy based on maximum Q-values. This may not be ideal in highly uncertain or partially observable environments.

Q-learning is a great starting point for reinforcement learning, offering clarity and foundational insights. However, its limitations in scalability and performance on complex tasks often lead practitioners to more advanced methods like Deep Q-Networks (DQN) or Policy Gradient methods for real-world applications.

Q-Learning vs Deep Q-Learning (DQN)

While Q-Learning is robust for small, simple environments, it hits a wall when dealing with large or continuous state spaces. That’s where Deep Q-Learning (DQN) comes in—a modern extension that leverages neural networks to scale Q-Learning to complex, high-dimensional problems.

What Is Deep Q-Learning (DQN)?

Deep Q-learning replaces the Q table with a deep neural network that estimates Q values. Instead of looking up Q values from a table, the network approximates them by learning a function Q(s, a) that maps states and actions to expected rewards.

This approach enables reinforcement learning in environments where enumerating every state-action pair is impossible, such as video games, robotics, or continuous control problems.

Key Differences

Feature	Q-Learning	Deep Q-Learning (DQN)
Q-value storage	Explicit table (Q-table)	Neural network function approximator
Scalability	Limited to small, discrete spaces	Handles large and continuous spaces
State representation	Must be discrete (e.g., grid)	Can use raw input (e.g., images, pixels)
Memory usage	Low	High (requires training and memory)
Exploration strategy	ε-greedy	ε-greedy + experience replay
Learning stability	Stable for small problems	Requires tricks to stabilize learning (e.g., target networks, replay buffers)

DQN Innovations That Make It Work

Deep Q-learning uses several enhancements to improve performance and stability:

Experience Replay: Stores past experiences and randomly samples them to break the data points’ correlation.
Target Network: A separate copy of the Q-network used to compute stable targets during training.
Frame Stacking (in games): Combines several frames as input to give the agent a sense of motion or context.

When to Use Which?

Use Q-Learning if:
- Your state space is small and discrete (e.g., a toy grid world, basic board games).
- You want to understand the core mechanics of reinforcement learning.
Use DQN if:
- Your state space is large or continuous.
- You’re working with complex data like images, sensor readings, or unstructured environments.

Q-Learning and DQN share the same goal—learning the optimal action policy through rewards—but differ in their tools and scalability. Q-Learning teaches the core principles, while DQN applies those principles to the real world using deep learning.

Practical Tips

Getting Q-Learning to work effectively in real applications requires more than just plugging in the algorithm. Here are some practical tips to help you get better results and avoid common pitfalls.

Choose Good Hyperparameters

Learning Rate (α): Start with a small value like 0.1. Too high, and the agent may forget past learning; too low, and learning becomes slow.
Discount Factor (γ): A value close to 1 (like 0.9 or 0.99) encourages the agent to plan ahead.
Exploration Rate (ε): Begin with a high value (e.g., 1.0) and decay over time to shift from exploration to exploitation gradually.

Use ε-Greedy with Decay

To balance exploration and exploitation:

Start with a high ε to encourage trying new actions.
Decay ε gradually (e.g., ε = max(0.01, ε * 0.995)) after each episode.
This helps the agent explore early and exploit knowledge later.

Initialise Q-values Optimistically

Set initial Q-values to a higher number (like 1 instead of 0). This encourages the agent to explore different actions early on, as it expects higher rewards.

Limit Episode Lengths

Cap the number of steps per episode to avoid infinite loops. For example, limit episodes to 100 or 200 steps in environments with no terminal states.

Visualise the Policy

Plot arrows or heatmaps to show the agent’s learned policy across states. This helps you debug and see if the learning is progressing as expected.

Track Reward Trends

Keep track of the total reward per episode. A rising trend over time usually indicates that the agent is learning. Sudden drops may mean bad parameter tuning or unstable updates.

Normalise or Discretise Effectively

If your environment has continuous state variables:

Discretise them into bins for classic Q-learning.
Too many bins? Consider switching to a function approximator like a neural network (i.e., use DQN).

Keep It Simple First

Before tackling complex environments, test Q-Learning on minor, apparent problems like Gridworld or FrozenLake. It’s easier to tune and debug in a controlled setting.

Watch Out for Dead Ends

In some environments, poor exploration can trap the agent in low-reward zones. To avoid local optima, use exploration techniques or revisit learning rate/epsilon decay settings.

Save and Resume Training

Q-tables can be saved as simple files (e.g., CSV or pickle format). This allows you to resume training later or deploy a trained agent without retraining from scratch.

Conclusion

Q-Learning is a powerful yet approachable algorithm that introduces core concepts in reinforcement learning—like states, actions, rewards, and value estimation—through a simple trial-and-error framework. Its elegance lies in its simplicity: continually updating a Q-table allows an agent to learn optimal behaviour in various environments without knowing the rules beforehand.

While it excels in smaller, discrete settings, Q-learning lays the groundwork for more advanced methods like Deep Q-Learning, which uses neural networks to scale to complex, high-dimensional problems.

Whether you’re a beginner looking to understand the basics of reinforcement learning or a practitioner building toward deep RL models, mastering Q-Learning is essential. With careful tuning, experimentation, and the tips covered in this post, you’ll be well on your way to training intelligent agents who learn from their environment and make better decisions over time.