Backpropagation Made Easy With Examples And How To In Python With Keras

by | Feb 24, 2023 | Artificial Intelligence, Machine Learning

Why is backpropagation important in neural networks? How does it work, how is it calculated, and where is it used? With a Python tutorial in Keras.

Introduction to backpropagation in Machine Learning

Backpropagation is a supervised machine learning algorithm that teaches artificial neural networks how to work. It is used to find the error gradients for the weights and biases in the network.

Gradient descent then uses these gradients to change the weights and biases. The goal of backpropagation is to make the difference between what the neural network thinks it will do and what it does as small as possible.

The backpropagation algorithm consists of two phases: a forward phase and a backward phase. In the forward phase, the input is propagated through the neural network, layer by layer, until the output is produced. The result is then compared to the true output, and the error between the two is calculated.

Using the chain rule of differentiation, the error is spread back through the network layer by layer during the backward phase. Then, gradient descent is used to change the weights and biases in each layer based on how the error changes the weights and biases in that layer. This is done for every training example in the training set, and the weights and biases are changed over and over again until the error is as small as possible.

Backpropagation is a robust algorithm that trains many neural network architectures, such as feedforward neural networks, recurrent neural networks, convolutional neural networks, and more. As a result, it is very good at solving complicated machine learning problems, such as classifying images, processing natural language, and recognising speech. But it’s essential to remember that backpropagation can be hard to programme and needs a lot of training data to work well.

Backpropagation is essential for image classification

Backpropagation is essential for image classification

What is backpropagation in neural networks?

Backpropagation is a widely used algorithm for training artificial neural networks. A supervised learning method enables a neural network to learn from a dataset by adjusting its weights and biases.

In backpropagation, the network’s output is compared to the desired output, and the difference between the two is found. Then, using a process called “gradient descent,” this error is sent back through the network, layer by layer, to change the weights and biases in each layer.

The goal of backpropagation is to minimise the error between the network’s output and the desired output by finding the optimal set of weights and biases that produce the slightest error. This process is iterative and involves multiple rounds of forward and backward propagation until the network’s output reaches an acceptable level of accuracy.

Backpropagation is a very important part of the field of neural networks because it makes it possible to train deep neural networks with many layers.

Which neural networks use backpropagation?

Backpropagation is a common algorithm for training a wide range of neural network architectures. It is a standard method for updating the weights and biases in the network during the training process, and it can be used with many different types of neural networks, including:

  1. Feedforward Neural Networks: These are the simplest type of neural network, consisting of input, hidden, and output layers. The weights are updated using backpropagation during the training process.
  2. Convolutional Neural Networks (CNNs): These are often used to classify images and do other tasks related to computer vision. After using convolutional layers to extract features from images, they use pooling layers to reduce the dimensionality of the data. Backpropagation is used to update the weights in the network during the training process.
  3. Recurrent Neural Networks (RNNs): These are often used for processing natural language and other tasks involving a data sequence. They use recurrent layers to capture temporal dependencies in the data. During the training process, backpropagation through time (BPTT) is used to update the weights in the network.
  4. Long Short-Term Memory (LSTM) Networks: These are RNNs designed to capture long-term dependencies in the data. They use a particular memory cell to store information over time. Backpropagation through time is used to update the weights in the network during the training process.
  5. Autoencoders: These are neural networks trained to reconstruct their input data. They consist of an encoder network that maps the input data to a lower-dimensional representation and a decoder network that maps the representation back to the original input. Backpropagation is used to update the weights in the network during the training process.

How does the algorithm work?

The backpropagation algorithm can be summarised in the following steps:

  1. Initialise the weights and biases in the neural network with random values.
  2. Feed the input data through the network to obtain the output.
  3. Calculate the error between the output and the expected output.
  4. Compute the gradient of the error concerning the weights and biases in the network using the chain rule of differentiation. This involves propagating the error back through the network and calculating the partial derivatives of the error concerning each weight and bias.
  5. Utilising the gradient descent algorithm, update the weights and biases in the network by deducting the gradient times a learning rate from the current weight or bias value.
  6. Repeat steps 2-5 for each training example in the dataset for a specified number of epochs or until the error falls below a certain threshold.
  7. Use the trained network to make predictions on new, unseen data.

The backpropagation algorithm can take a lot of processing power, especially for large datasets and networks with many layers and neurons. Many optimisation techniques, such as mini-batch gradient descent, momentum, and adaptive learning rates can be used to improve performance.

A simple backpropagation example

Let’s take a simple example to illustrate backpropagation:

Suppose we have a neural network with a single input layer, a hidden layer, and an output layer, as shown below:

Input Layer      Hidden Layer      Output Layer
 (1 neuron)      (3 neurons)       (1 neuron)
    x1             h1, w11           o1, w21
                                   /         \
                                  /           \
                                h2, w12        \
                                                w22
                                                \
                                                 y

where x1 is the input to the network, h1 and h2 are the hidden layer neurons, o1 is the output neuron, w11 , w12 , w21 , and w22 are the weights connecting the neurons, and y is the desired output.

To use backpropagation to train the network, we first give it the input x1 and figure out the output y :

h1 = sigmoid(x1 * w11)
h2 = sigmoid(x1 * w12)
o1 = sigmoid(h1 * w21 + h2 * w22)

Sigmoid is the activation function used by the neurons, which maps the neuron’s input to a value between 0 and 1.

We then calculate the error between the network’s output o1 and the desired output y :

error = 1/2 * (y - o1)^2

where the factor of 1/2 is included for convenience.

To update the weights in the network, we need to calculate the partial derivative of the error with respect to each weight using the chain rule of differentiation:

d_error/d_w21 = d_error/d_o1 * d_o1/d_h1 * d_h1/d_w21
              = (o1 - y) * o1 * (1 - o1) * h1
d_error/d_w22 = d_error/d_o1 * d_o1/d_h2 * d_h2/d_w22
              = (o1 - y) * o1 * (1 - o1) * h2
d_error/d_w11 = d_error/d_o1 * d_o1/d_h1 * d_h1/d_w11
              = (o1 - y) * o1 * (1 - o1) * x1
d_error/d_w12 = d_error/d_o1 * d_o1/d_h2 * d_h2/d_w12
              = (o1 - y) * o1 * (1 - o1) * x1

We can then update the weights using gradient descent:

w21 = w21 - learning_rate * d_error/d_w21
w22 = w22 - learning_rate * d_error/d_w22
w11 = w11 - learning_rate * d_error/d_w11
w12 = w12 - learning_rate * d_error/d_w12

where learning_rate is a hyperparameter that controls the size of the weight updates.

We repeat this process for multiple iterations, adjusting the weights each time until the network’s output reaches an acceptable level of accuracy.

Backpropagation vs gradient descent

Backpropagation and gradient descent are closely related. It is used to calculate the gradients of the error with respect to the weights and biases in the neural network, and gradient descent is used to update the weights and biases based on the gradients.

The backpropagation algorithm uses the chain rule of differentiation to determine the error gradients for each weight and bias in the network. These gradients indicate how much the error changes as each weight and bias is adjusted and in what direction the change should be made to reduce the error.

Gradient descent

Once the gradients have been computed, the gradient descent algorithm is used to update the weights and biases in the network in the direction of steepest descent, i.e., the direction that reduces the error the most. This is achieved by subtracting the gradient multiplied by a learning rate from each weight and bias, as shown in the following update rule:

w_i = w_i - learning_rate * d_error/d_w_i
b_i = b_i - learning_rate * d_error/d_b_i

where w_i and b_i are the weight and bias of the i-th neuron in the network, learning_rate is a hyperparameter that controls the size of the weight and bias updates, and d_error/d_w_i and d_error/d_b_i are the gradients of the error with respect to w_i and b_i , respectively, computed by backpropagation.

The learning rate is a critical hyperparameter that determines the step size taken by the optimiser in the weight and bias space. A high learning rate can cause the optimiser to overshoot the optimal weights and biases, leading to instability and slow convergence. In contrast, a low learning rate can cause the optimiser to converge slowly or get stuck in a suboptimal local minimum.

In summary, backpropagation computes the gradients of the error with respect to the weights and biases in the neural network, and gradient descent uses these gradients to update the weights and biases in the direction of the steepest descent until the error is minimised.

How to use Keras to implement backpropagation

Keras is a high-level library for neural networks that makes building and training neural networks simple and easy to do. During training, the optimiser in Keras takes care of backpropagation and gradient descent automatically.

To train a neural network in Keras using backpropagation and gradient descent, the following steps can be followed:

  1. Define the architecture of the neural network by using the Keras API to set the number of layers, the number of neurons in each layer, the activation functions, and other hyperparameters.
  2. Compile the model using a suitable optimiser, such as Stochastic Gradient Descent (SGD), Adam, or RMSprop, and a loss function that measures the error between predicted and true output.
  3. Fit the model to the training data by calling the fit() method and giving it the input data, the output data, the number of epochs, the batch size, and other training parameters. During training, the optimiser automatically computes the gradients of the loss function with respect to the weights and biases in the network using backpropagation and updates the weights and biases using gradient descent.
  4. Evaluate the performance of the trained model on a separate validation set using the evaluate() method, or make predictions on new, unseen data using the predict() method.

Code example

Here’s an example code snippet that demonstrates how to train a simple neural network in Keras using backpropagation and gradient descent:

from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import SGD

# Define the architecture of the neural network
model = Sequential()
model.add(Dense(64, input_dim=784, activation='relu'))
model.add(Dense(10, activation='softmax'))

# Compile the model
sgd = SGD(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])

# Fit the model to the training data
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_val, y_val))

# Evaluate the performance of the trained model
loss, accuracy = model.evaluate(X_test, y_test)
print('Test loss:', loss)
print('Test accuracy:', accuracy)

In this example, we define a neural network with two layers, a ReLU activation function for the hidden layer, and a softmax activation function for the output layer. We compile the model using SGD as the optimiser, categorical cross-entropy as the loss function, and accuracy as the evaluation metric. We then fit the model to the training data, specifying the number of epochs and batch size, and validating the model on a separate validation set. Finally, we evaluate the performance of the trained model on a test set.

Conclusion

Backpropagation is a supervised machine learning algorithm that teaches artificial neural networks how to work. It is used to find the error gradients with respect to the weights and biases in the network. Gradient descent then uses these gradients to change the weights and biases.

Backpropagation is a powerful algorithm that trains many neural network architectures, such as feedforward neural networks, convolutional neural networks, and recurrent neural networks. It is a common algorithm in machine learning and has been a key part of the success of neural networks for solving hard machine learning problems.

About the Author

Neri Van Otten

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.

Recent Articles

cloud vs edge computing

NLP And Edge Computing: How It Works & Top 7 Technologies for Offline Computing

In the age of digital transformation, Natural Language Processing (NLP) has emerged as a cornerstone of intelligent applications. From chatbots and voice assistants to...

elastic net vs l1 and l2 regularization

Elastic Net Made Simple & How To Tutorial In Python

What is Elastic Net Regression? Elastic Net regression is a statistical and machine learning technique that combines the strengths of Ridge (L2) and Lasso (L1)...

how recursive feature engineering works

Recursive Feature Elimination (RFE) Made Simple: How To Tutorial

What is Recursive Feature Elimination? In machine learning, data often holds the key to unlocking powerful insights. However, not all data is created equal. Some...

high dimensional dat challenges

How To Handle High-Dimensional Data In Machine Learning [Complete Guide]

What is High-Dimensional Data? High-dimensional data refers to datasets that contain a large number of features or variables relative to the number of observations or...

in-distribution vs out-of-distribution example

Out-of-Distribution In Machine Learning Made Simple & How To Detect It

What is Out-of-Distribution Detection? Out-of-Distribution (OOD) detection refers to identifying data that differs significantly from the distribution on which a...

types of anomalies in LLMs

Anomaly Detection In LLM Responses [How To Monitor & Mitigate]

What is Anomaly Detection in LLMs? Anomaly detection in the context of Large Language Models (LLMs) involves identifying outputs, patterns, or behaviours that deviate...

types of text annotation

Text Annotation Made Simple And 7 Popular Tools

What is Text Annotation? Text annotation is the process of labelling or tagging text data with specific information, making it more understandable and usable for...

average rating by sentiment

How To Process Text In Python With Pandas Made Simple

Introduction Text data is everywhere—from social media posts and customer reviews to emails and product descriptions. For data scientists and analysts, working with...

causes of missing data

Handling Missing Data In Machine Learning: Top 8 Techniques & How To Tutorial In Python

What is Missing Data in Machine Learning? In machine learning, the quality and completeness of data are often just as important as the algorithms and models we choose....

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

nlp trends

2024 NLP Expert Trend Predictions

Get a FREE PDF with expert predictions for 2024. How will natural language processing (NLP) impact businesses? What can we expect from the state-of-the-art models?

Find out this and more by subscribing* to our NLP newsletter.

You have Successfully Subscribed!