What Is Overfitting & Underfitting [How To Detect & Overcome In Python]

by | Feb 28, 2023 | Machine Learning, Natural Language Processing

Illustrated examples of overfitting and underfitting, as well as how to detect & overcome them

Overfitting and underfitting are two common problems in machine learning where the model becomes too complex or too simple for the given dataset. This article illustrates both problems with simple examples and elaborates on ways to detect and overcome both challenges.

Finding the right balance between overfitting and underfitting is crucial for building a good machine learning model that can generalise well to new data.

What is overfitting in machine learning?

Overfitting is a common problem in machine learning where a model performs exceptionally well on the training data but poorly on new, unseen data. Overfitting occurs when a model becomes too complex and learns noise or irrelevant patterns in the training data rather than the true underlying patterns that generalise well to new data.

Overfitting occurs when a model becomes too complex and learns noise or irrelevant patterns

Overfitting occurs when a model becomes too complex and learns noise or irrelevant patterns

In other words, overfitting happens when a model memorises the training data instead of learning the underlying relationship between the input features and the output variable. This can result in a model that performs very well on the training data but poorly on new, unseen data because it has become too specialised for the training data.

Overfitting can be addressed through regularisation, cross-validation, and early stopping.

This can help the model better generalise to new data by reducing its complexity and preventing it from memorising noise or irrelevant patterns in the training data.

What is underfitting in machine learning?

Underfitting is a common problem in machine learning. It happens when a model needs to be more complex to capture the underlying patterns in the data. Unfortunately, this means that the model needs to improve on the training data and new data it has never seen before.

In other words, underfitting occurs when a model is not complex enough to learn the underlying relationship between the input features and the output variable. This can result in a model that performs poorly on the training data and new, unseen data because it has not learned enough from it to generalize well.

Underfitting can be addressed through techniques such as increasing the complexity of the model, adding more features or increasing the number of hidden layers in a neural network, and increasing the training time.

However, finding a balance between model complexity and generalisation is vital, as overfitting can also be a problem when the model becomes too complex. Therefore, using techniques such as cross-validation to evaluate the model’s performance on the training data and new, unseen data and choosing a model that performs well on both is important.

Overfitting and underfitting simple example

Suppose you have a dataset with one input feature (e.g., the number of hours studied) and one output variable (e.g., the exam score) that looks like this:

Hours StudiedExam Score 
120
1.520
260
375
485
590
695

Let’s say you want to teach a machine learning model to predict the score on an exam based on how many hours you studied.

  • Suppose you train a linear regression model on this dataset. In this case, the model may underfit the data because the relationship between the input feature and the output variable may not be linear. In other words, a straight line may not capture the underlying patterns in the data. The model may predict poorly on both the training data and new, unseen data, as shown in the following plot:
polynomial degree 1

Example of underfitting

If you want to recreate the plot with your data, you can adapt the Python code below that uses Matplotlib to generate the plots quickly.

import numpy
import matplotlib.pyplot as plt

x = [1,1.5,2,3,4,5,6]
y = [20, 20, 60, 75, 85, 90, 95]

# change the last 1 to a 2 or 5 to generate the other plots further down
mymodel = numpy.poly1d(numpy.polyfit(x, y, 1))

myline = numpy.linspace(1, 6, 100)

plt.scatter(x, y)
plt.plot(myline, mymodel(myline))
plt.show()
  • Suppose you train a polynomial regression model on this dataset with a degree of 5. In this case, the model may overfit the data because it is too complex to capture the underlying patterns. The model may perform very well on the training data but poorly on new, unseen data, as shown in the following plot:
polynomial degree 5

Example of overfitting

If we use this model to conclude the optimum study time, we would incorrectly conclude that 5.5 hours of study is better than 6 or more.

This shows the danger of overfitting. The model doesn’t represent the actual pattern you are trying to predict.

  • If you train a polynomial regression model on this dataset with a degree of 2, the model may fit the data well and generalise well to new, unseen data, as shown in the following plot:
polynomial degree 2

Example of a function generalising well

In short, underfitting happens when the model needs to be more complex to capture the underlying patterns in the data.

Conversely, overfitting occurs when the model needs to be simplified and starts to learn noise or irrelevant patterns in the data.

A good fit is when the model can capture the true patterns in the data without overfitting or underfitting.

Examples of underfitting and overfitting in real applications

Here are some real-world examples of underfitting and overfitting.

Underfitting

Suppose you have a dataset of images of handwritten digits, and you want to train a machine learning model to recognise the numbers. If you use a simple model, such as logistic regression, the model may underfit the data because it needs to be more complex to capture the complex patterns in the images. As a result, the model may need to improve on both the training data and new, unseen data.

To address underfitting, you can use more complex models, such as convolutional neural networks, better suited to capture complex image patterns.

Overfitting

Let’s say you have a set of customer data, like their age, income, gender, and buying habits, and you want to train a machine learning model to predict which customers will likely make a purchase. Suppose you use a complex model, such as a deep neural network with many layers. In that case, the model may overfit the data because it needs to be simplified and starts to learn noise or irrelevant patterns in the data. As a result, the model may perform very well on the training data but poorly on new, unseen data.

To address overfitting, you can use regularisation techniques, such as L1 or L2 regularisation, that add a penalty term to the loss function to prevent the model from overfitting.

How to check if the model is overfitting or underfitting

There are several ways to detect over- or under-fitting in a machine learning model:

  1. Plot the learning curves: Learning curves show the model’s performance on training and validation data over time as the model is being trained. If the model is overfitting, you will see that the training error continues to decrease over time, while the validation error starts to increase after a certain point. This indicates that the model is beginning to memorise the training data and needs to be generalised well to new, unseen data.
  2. Evaluate the model on a holdout set: A holdout set is a subset of the data that is not used during training but is used to evaluate the model after training. If the model performs well on the training data but poorly on the holdout set, it may be overfitting the training data.
  3. Use cross-validation: Cross-validation is a technique where the data is divided into k-folds, and the model is trained and evaluated on each fold. If the model performs well on the training data but poorly on the validation data, it may need to be more balanced.
  4. Regularise the model: Regularization is a method that adds a penalty term to the loss function to stop the model from becoming too similar to the training data. By changing the regularisation parameter, you can control how hard the model is to understand and prevent it from becoming too simple.
  5. Use simpler models: If your complex model is overfitting the data, you can use simpler models less prone to overfitting, such as linear models or decision trees with low depth.

In general, it’s crucial to monitor the model’s performance during training and evaluation and to be aware of the trade-off between model complexity and generalisation performance.

Underfitting and overfitting in NLP

Underfitting and overfitting are common problems in tasks like text classification, sentiment analysis, and machine translation using Natural Language Processing (NLP). Here are some examples of underfitting and overfitting in NLP:

  • Underfitting in NLP: If your NLP model is too simple and needs more complexity, it might not fit the data well enough and miss the essential patterns in the text data. For example, suppose you are building a sentiment analysis model and only use simple bag-of-words features without context or semantic information. In that case, the model may need to be more balanced on the training and validation data.
  • Overfitting in NLP: If your NLP model is too complex and has too many parameters, it may overfit the data and start to learn noise or irrelevant patterns in the text data. For example, suppose you build a machine translation model and use an extensive neural network with too many layers. In that case, the model may overfit the training data and perform poorly on new, unseen data.

You can use the same methods as other machine learning tasks to find underfitting and overfitting in NLP models.

Conclusion

Overfitting and underfitting are common challenges in machine learning. Overfitting occurs when a model is too complex and learns noise or irrelevant patterns in the data.

At the same time, underfitting occurs when a model is too simple and cannot capture the underlying patterns in the data. To detect overfitting and underfitting, you can use techniques such as plotting learning curves, evaluating the model on a holdout set, and using cross-validation.

To address overfitting and underfitting, you can use regularisation, simpler or more complex models, or add more input features. Ultimately, the goal is to find the right balance between model complexity and data fit to achieve optimal model performance on new, unseen data.

About the Author

Neri Van Otten

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.

Recent Articles

multi-agent reinforcement learning marl

Multi-Agent Reinforcement Learning Made Simple, Top Approaches & 9 Tools

Introduction Imagine a group of robots cleaning a warehouse, a swarm of drones surveying a disaster zone, or autonomous cars navigating through city traffic. In each of...

viterbi algorithm example

Viterbi Algorithm Made Simple [How To & Worked-Out Examples]

Introduction Imagine trying to understand what someone said over a noisy phone call or deciphering a DNA sequence from partial biological data. In both cases, you're...

link prediction in graphical neural networks

Structured Prediction In Machine Learning: What Is It & How To Do It

What is Structured Prediction? In traditional machine learning tasks like classification or regression a model predicts a single label or value for each input. For...

q-learning explained witha a mouse navigating a maze and updating it's internal staate

Policy Gradient [Reinforcement Learning] Made Simple In An Elaborate Guide

Introduction Reinforcement Learning (RL) is a powerful framework that enables agents to learn optimal behaviours through interaction with an environment. From mastering...

q learning example

Deep Q-Learning [Reinforcement Learning] Explained & How To Example

Imagine teaching a robot to navigate a maze or training an AI to master a video game without ever giving it explicit instructions—only rewarding it when it does...

deepfake is deep learning and fake put together

Deepfake Made Simple, How It Work & Concerns

What is Deepfake? In an age where digital content shapes our daily lives, a new phenomenon is challenging our ability to trust what we see and hear: deepfakes. The term...

data filtering

Data Filtering Explained, Types & Tools [With How To Tutorials]

What is Data Filtering? Data filtering is sifting through a dataset to extract the specific information that meets certain criteria while excluding irrelevant or...

types of data encoding

Data Encoding Explained, Different Types, How To Examples & Tools

What is Data Encoding? Data encoding is the process of converting data from one form to another to efficiently store, transmit, and interpret it by machines or systems....

what is data enrichment?

Data Enrichment Made Simple [Different Types, How It Works & Common Tools]

What is Data Enrichment? Data enrichment enhances raw data by supplementing it with additional, relevant information to improve its accuracy, completeness, and value....

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

nlp trends

2025 NLP Expert Trend Predictions

Get a FREE PDF with expert predictions for 2025. How will natural language processing (NLP) impact businesses? What can we expect from the state-of-the-art models?

Find out this and more by subscribing* to our NLP newsletter.

You have Successfully Subscribed!