Supervised Learning Explained & How To Choose The Right Model

Machine learning can be broadly classified into three major types: supervised, unsupervised, and reinforcement learning. We will go into supervised learning in more detail in this article and compare it to the other two types.

Table of Contents

Supervised learning is a machine learning approach used for problems where the data is labelled examples or data points with features and a corresponding label.

Supervised learning is used where labelled data is available

The objective of supervised learning algorithms is to learn, for example, input-output pairs, a function that maps inputs to outputs. This may sound rather technical but bear with us. An example will quickly provide clarity.

A simple example of supervised learning

Let’s say we would like to find out the weight of a new client joining a gym so that we can adjust equipment accordingly during their first visit. However, the customer success team at the gym has found that clients feel uncomfortable disclosing their weight, and prospective clients tend to discontinue the onboarding process at this stage.

As a result, the gym no longer asks for this data, and the trainer now needs to adjust the equipment based on the information provided and has decided to deduce a person’s weight based on their height. He samples his current client base and plots the results. He can then draw a line of best fit, also known as a linear model. Now, whenever a new client joins, the trainer can “predict” or “estimate” their weight based on their height.

Although this might sound like an easy example, all supervised machine learning techniques follow this same process. Firstly, we consider the problem. What output are we trying to achieve? Our output is going to be the weight. Then we start collecting data; in our case, personal data related to the current clients. Note that it is essential to use the existing client base and not just a larger population sample, as we want our data to represent the problem we are solving.

Note that the definition of “a representative sample” is often a contentious issue in every machine learning discussion.

Making predictions

We then determine the input features. In our example, this was the height, but the model could be further improved by adding other features already in the dataset. Age and gender come to mind. The features should generally contain enough information to predict the output accurately. Problems also occur when we add too many features, so more is not always better here.

Our final step is then to choose an algorithm and train the model. In our case, the “training” is finding the line of best fit through our scattered data points. The line of best fit, or the regression line, is our model.

To use the model, we can provide a height and get the corresponding weight on the regression line.

Different types of supervised machine learning algorithms

There wasn’t any actual training involved in the previous example. Still, other more complicated models are trained until they are capable of recognising the underlying relationships and patterns between the input labels and the output labels. This enables the model to produce accurate labelling results when presented with data it has never seen before.

Supervised learning algorithms can primarily generate two kinds of results: classification and regression.

Classification algorithms

A classification algorithm divides inputs into a certain number of categories or classes. Categories are based on the labelled data it was trained on. Classification algorithms can be used for binary classifications, such as classifying customer feedback as positive or negative and classifying emails as spam. Another common classification issue solved by supervised learning is feature recognition. It can identify handwritten letters and numbers or classify drugs into numerous categories.

Regression models

Regression tasks are distinct because they anticipate a numerical relationship between the input and output data from the model. These models can predict things like real estate prices based on zip codes, click rates in online ads based on the time of day, or customer willingness to pay for a particular product based on age.

The following are some examples of algorithms frequently used in supervised learning programs:

Linear regression
Logistic regression
Support-vector machines
Decision trees
Naive Bayes
Linear discriminant analysis
Similarity learning
K-nearest neighbour algorithm
Neural networks

When choosing a supervised learning algorithm, a few things should be considered. The first is the bias and variance within the algorithm, as there is a fine line between being flexible enough and too flexible. Another is the complexity of the model or function that the system is trying to learn. Finally, as noted, the data’s heterogeneity, accuracy, redundancy, and linearity should also be analysed before choosing an algorithm.

Choosing a machine learning algorithm

As we have seen, there are numerous supervised learning algorithms available, each with its advantages and disadvantages, and as such, there isn’t a single learning algorithm that solves all supervised learning issues. These are, however, four major points to consider when choosing an algorithm.

Bias-variance trade-off

We introduced the terms “bias” and “variance” earlier. From this, it should have become clear that there is a trade-off between the two. Imagine that we have a variety of equally helpful training data sets at our disposal. When trained on each data set, a learning algorithm is considered biased for a given input if it consistently predicts the incorrect output. On the other hand, an algorithm has a high variance for a particular input if it predicts different output values when trained on different training sets.

The sum of the bias and the variance of a learning algorithm is related to the prediction error, and so there is always a tradeoff between bias and variance.

An algorithm with low bias is flexible and so can fit the data well. However, if it is too flexible, it will fit each training data set differently and hence have high variance. The bias-variance trade-off is necessary for many problems, so most supervised learning algorithms allow you to adjust this.

Complexity and amount of training data

The second thing to think about is how much training data is available compared to how complex the “true” function is. If the learning algorithm is simple, an “inflexible” learning algorithm with high bias and low variance can learn the actual function from a small amount of data. On the other hand, if the true function is rather complex, it can only be understood using a lot of training data and a “flexible” learning algorithm with low bias and high variance.

The dimensionality of the input space

A third issue is the dimensionality of the input space. Even though only a few of those features are necessary for the actual function to work, learning the function can be difficult if the input feature vectors have high dimensionality. This is because the many “extra” dimensions could confuse the learning algorithm and make it more unpredictable.

As a result, tuning the classifier to have a high bias and low variance is typically required for large-dimensional input data. If you can manually remove unimportant features from the input data, the learned function’s accuracy will likely be improved in real-world applications.

Numerous feature selection algorithms seek to separate the relevant features from the irrelevant ones.

Noise in the output values

The level of noise in the target variables (which represent the desired output values) is the fourth issue. If the desired output values are frequently incorrect (due to human or sensor error), the learning algorithm should avoid trying to find a function that matches the training examples exactly. When the data is fitted too precisely, overfitting occurs.

overfitting occurs when data is fitted too precisely

Overfitting occurs when you fit data too precisely.

Even with no measurement errors, you can overfit a function if it is too complex for your learning model. In this case, the un-fittable portion of the target function “corrupts” your training data, referred to as deterministic noise. It is possible to reduce noise in output values in practice using several techniques. For example, we can apply early stopping to prevent overfitting or identify and remove noisy training examples before training the supervised learning algorithm. The removal of potentially noisy training samples before training can statistically significantly reduce generalisation error.

Other factors to consider

When selecting and using a learning algorithm, there are additional factors to take into account:

Data heterogeneity. Some algorithms are more straightforward to use than others when the feature vectors contain features of various types (discrete, discrete ordered, counts, and continuous values). The input features must be numerical and scaled to similar ranges for many algorithms, including support-vector machines, linear regression, logistic regression, neural networks, and nearest neighbour techniques (e.g., to the [-1,1] interval). Decision trees have the benefit of being able to handle heterogeneous data with ease.
Redundant data. Some learning algorithms, like linear regression, logistic regression, and distance-based methods, will perform poorly if the input features contain redundant information (for example, highly correlated features) because of numerical instabilities. Regularisation can frequently be imposed to address these issues.
Interactions and non-linearities. Algorithms based on linear functions (e.g., logistic regression, support vector machines, naive Bayes) and distance functions (e.g., nearest neighbour methods, support-vector machines with Gaussian kernels) typically perform well when each feature contributes independently to the output. However, because decision trees and neural networks are created specifically to find these interactions, they perform better when complex interactions exist between features. You can also use linear methods, but when doing so, the interactions must be manually specified.

You can compare various learning algorithms when thinking about a new application and experimentally determine which one performs best on the current issue. It can take a lot of time to adjust a learning algorithm’s performance. Given fixed resources, spending more time gathering more training data and informative features is frequently preferable to spending more time fine-tuning the learning algorithms.

Supervised vs unsupervised learning

What is unsupervised learning?

The algorithm’s learning process is the main distinction between supervised and unsupervised learning. Unlabeled data is provided to the algorithm as a training set in unsupervised learning. There are no correct output values, in contrast to supervised learning, because the algorithm finds patterns and similarities within the data rather than correlating it to some outside measurement. Unsupervised learning algorithms can operate to learn more about the data and discover intriguing or unexpected results that people weren’t looking for. Applications of clustering (the process of identifying groups within data) and association benefit greatly from unsupervised learning (predicting rules that describe the data).

Benefits and limitations

Compared to the unsupervised method, supervised learning models have some benefits, but they also have drawbacks. Because humans have provided the basis for the decisions, supervised learning systems are more likely to reach conclusions that humans can understand.

However, supervised learning systems have trouble adjusting to new information using a retrieval-based approach. Bicycles, for instance, would have to be incorrectly categorised if a system with categories for cars and trucks were presented. However, if the AI system were generative (that is, unsupervised), it might not be able to identify the bicycle. Still, it would be able to identify it as being in a different category.

To achieve acceptable performance levels, supervised learning also typically needs a significant amount of accurately labelled data. Labelled data is, unfortunately, not always available. This issue is not present in unsupervised learning, which can also use unlabeled data.

Semi-supervised learning

Semi-supervised learning might be the best learning strategy when supervised learning is required, but we lack high-quality data. This learning model, which sits between supervised and unsupervised learning, accepts partially labelled data. This means that most of the data is unlabeled.

Similar to unsupervised learning, semi-supervised learning finds the correlations between the data points and then uses the labelled data to mark those data points. The entire model is then trained using the recently applied labels.

Given the limited amount of labelled data, supervised learning algorithms would not be able to perform as intended in many real-world problems. However, semi-supervised learning has been shown to produce accurate results. As a general rule, semi-supervised learning can be applied to data sets with at least 25% labelled data.

How we used supervised machine learning

Natural language processing (NLP) lends itself well to many supervised learning problems. Whether extracting company names, financial data points or specific relevant paragraphs in much larger documents. Supervised algorithms can accurately extract this data as long as we provide some examples to the algorithm.

A concrete example would be spam filtering in an email inbox. We label some emails as spam and others as relevant as we go through our inboxes. You can then use this labelled data to predict whether new emails coming into the inbox are spam by running them through a trained spam classification algorithm. The output can then direct the incoming mail to the correct folder.

spam detection is a typical supervised learning problem

An email spam classification algorithm is a typical supervised learning problem.

Classifying spam is a simple example, but in the real world, we often encounter complicated scenarios. Are advertisements and promotions spam? What is relevant for one person might be irrelevant for the next. So, even simple examples quickly become complicated when trying to implement them in the real world. The algorithm not only needs to be trained for every person but also continuously over time. We find relevant today might become irrelevant to us in a month.

So even though supervised machine learning problems are commonly used, a lot goes into the design, selection, and maintenance process. Understanding the trade-offs in this process will help you achieve better outcomes and manage expectations accordingly.