Illustrated examples of overfitting and underfitting, as well as how to detect & overcome them
Overfitting and underfitting are two common problems in machine learning where the model becomes too complex or too simple for the given dataset. This article illustrates both problems with simple examples and elaborates on ways to detect and overcome both challenges.
Finding the right balance between overfitting and underfitting is crucial for building a good machine learning model that can generalise well to new data.
Overfitting is a common problem in machine learning where a model performs exceptionally well on the training data but poorly on new, unseen data. Overfitting occurs when a model becomes too complex and learns noise or irrelevant patterns in the training data rather than the true underlying patterns that generalise well to new data.
Overfitting occurs when a model becomes too complex and learns noise or irrelevant patterns
In other words, overfitting happens when a model memorises the training data instead of learning the underlying relationship between the input features and the output variable. This can result in a model that performs very well on the training data but poorly on new, unseen data because it has become too specialised for the training data.
Overfitting can be addressed through regularisation, cross-validation, and early stopping.
This can help the model better generalise to new data by reducing its complexity and preventing it from memorising noise or irrelevant patterns in the training data.
Underfitting is a common problem in machine learning. It happens when a model needs to be more complex to capture the underlying patterns in the data. Unfortunately, this means that the model needs to improve on the training data and new data it has never seen before.
In other words, underfitting occurs when a model is not complex enough to learn the underlying relationship between the input features and the output variable. This can result in a model that performs poorly on the training data and new, unseen data because it has not learned enough from it to generalize well.
Underfitting can be addressed through techniques such as increasing the complexity of the model, adding more features or increasing the number of hidden layers in a neural network, and increasing the training time.
However, finding a balance between model complexity and generalisation is vital, as overfitting can also be a problem when the model becomes too complex. Therefore, using techniques such as cross-validation to evaluate the model’s performance on the training data and new, unseen data and choosing a model that performs well on both is important.
Suppose you have a dataset with one input feature (e.g., the number of hours studied) and one output variable (e.g., the exam score) that looks like this:
Hours Studied | Exam Score |
1 | 20 |
1.5 | 20 |
2 | 60 |
3 | 75 |
4 | 85 |
5 | 90 |
6 | 95 |
Let’s say you want to teach a machine learning model to predict the score on an exam based on how many hours you studied.
Example of underfitting
If you want to recreate the plot with your data, you can adapt the Python code below that uses Matplotlib to generate the plots quickly.
import numpy
import matplotlib.pyplot as plt
x = [1,1.5,2,3,4,5,6]
y = [20, 20, 60, 75, 85, 90, 95]
# change the last 1 to a 2 or 5 to generate the other plots further down
mymodel = numpy.poly1d(numpy.polyfit(x, y, 1))
myline = numpy.linspace(1, 6, 100)
plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()
Example of overfitting
If we use this model to conclude the optimum study time, we would incorrectly conclude that 5.5 hours of study is better than 6 or more.
This shows the danger of overfitting. The model doesn’t represent the actual pattern you are trying to predict.
Example of a function generalising well
In short, underfitting happens when the model needs to be more complex to capture the underlying patterns in the data.
Conversely, overfitting occurs when the model needs to be simplified and starts to learn noise or irrelevant patterns in the data.
A good fit is when the model can capture the true patterns in the data without overfitting or underfitting.
Here are some real-world examples of underfitting and overfitting.
Suppose you have a dataset of images of handwritten digits, and you want to train a machine learning model to recognise the numbers. If you use a simple model, such as logistic regression, the model may underfit the data because it needs to be more complex to capture the complex patterns in the images. As a result, the model may need to improve on both the training data and new, unseen data.
To address underfitting, you can use more complex models, such as convolutional neural networks, better suited to capture complex image patterns.
Let’s say you have a set of customer data, like their age, income, gender, and buying habits, and you want to train a machine learning model to predict which customers will likely make a purchase. Suppose you use a complex model, such as a deep neural network with many layers. In that case, the model may overfit the data because it needs to be simplified and starts to learn noise or irrelevant patterns in the data. As a result, the model may perform very well on the training data but poorly on new, unseen data.
To address overfitting, you can use regularisation techniques, such as L1 or L2 regularisation, that add a penalty term to the loss function to prevent the model from overfitting.
There are several ways to detect over- or under-fitting in a machine learning model:
In general, it’s crucial to monitor the model’s performance during training and evaluation and to be aware of the trade-off between model complexity and generalisation performance.
Underfitting and overfitting are common problems in tasks like text classification, sentiment analysis, and machine translation using Natural Language Processing (NLP). Here are some examples of underfitting and overfitting in NLP:
You can use the same methods as other machine learning tasks to find underfitting and overfitting in NLP models.
Overfitting and underfitting are common challenges in machine learning. Overfitting occurs when a model is too complex and learns noise or irrelevant patterns in the data.
At the same time, underfitting occurs when a model is too simple and cannot capture the underlying patterns in the data. To detect overfitting and underfitting, you can use techniques such as plotting learning curves, evaluating the model on a holdout set, and using cross-validation.
To address overfitting and underfitting, you can use regularisation, simpler or more complex models, or add more input features. Ultimately, the goal is to find the right balance between model complexity and data fit to achieve optimal model performance on new, unseen data.
Have you ever wondered why raising interest rates slows down inflation, or why cutting down…
Introduction Reinforcement Learning (RL) has seen explosive growth in recent years, powering breakthroughs in robotics,…
Introduction Imagine a group of robots cleaning a warehouse, a swarm of drones surveying a…
Introduction Imagine trying to understand what someone said over a noisy phone call or deciphering…
What is Structured Prediction? In traditional machine learning tasks like classification or regression a model…
Introduction Reinforcement Learning (RL) is a powerful framework that enables agents to learn optimal behaviours…