What are bias, variance and the bias-variance trade-off?
The bias-variance trade-off is a fundamental concept in supervised machine learning that refers to the trade-off between the error due to bias and the error due to variance in a model.
Table of Contents
What is bias?
Bias is the difference between the predicted values by a model and the actual values, and it represents how much a model’s predictions deviate from the correct values. High bias indicates that the model is underfitting the data, meaning it cannot capture the underlying patterns.
What is variance?
Conversely, variance represents the variability of the model’s predictions for different training datasets. High variance indicates that the model is overfitting the data, meaning it memorises the training data and cannot generalise well to new, unseen data.
The trade-off
A good model aims to find a balance between bias and variance. A too-simple model will have a high bias and low variance, while a too-complex model will have a low bias and high variance.
To achieve a good balance between bias and variance, various techniques such as cross-validation, regularisation, and ensemble methods are used to control the complexity of the model and reduce the error due to bias and variance.
High bias, low variance
A model that has high bias and low variance is said to be underfitting the data. This means that the model is too simple and cannot capture the underlying patterns in the data. As a result, the model has high training and testing errors. In other words, the model cannot fit the training data well or generalise to new, unseen data.
Increasing the complexity of the model, adding more features, or using a more robust model may help reduce the error due to bias and improve the model’s performance. Another approach to reducing bias is to improve the quality or quantity of the training data, for example, by collecting more data or improving the data preprocessing steps. However, it’s essential to remember that increasing the model’s complexity may also increase the error due to variance, so finding the right balance is crucial.
Low bias high variance overfitting
A model with low bias and high variance is said to be overfitting the data. This means the model is too complex and memorises the training data instead of learning the underlying patterns. As a result, the model has a low training error but a high testing error. In other words, the model can fit the training data well but not generalise to new, unseen data.
In such cases, reducing the complexity of the model, removing irrelevant features, or using regularisation techniques such as L1/L2 regularisation or dropout may help reduce the error due to variance and improve the model’s performance. Another approach is to increase the training data or use data augmentation techniques to generate more training samples.
It’s important to note that reducing the error due to variance may increase the error due to bias, so finding the right balance between bias and variance is essential to develop a good model. One way to achieve this balance is by using cross-validation techniques to evaluate the model’s performance on multiple data splits and choosing the model that performs well on both the training and testing data.
Bias-variance trade-off in machine learning
In machine learning, the bias-variance trade-off refers to the relationship between the complexity of a model and its ability to fit the data. A model with high bias is too simple and cannot capture the genuine relationship between the input and output variables. On the other hand, a model with high variance is too complex and captures the random noise in the data, resulting in poor generalisation of new data.
Machine learning aims to develop a model that can generalise well to new, unseen data. To achieve this, we need to find a balance between bias and variance. A too-simple model will have a high bias and low variance, while a too-complex model will have a low bias and high variance.
The bias-variance trade-off is important for all machine learning models, including trading algorithms
We must evaluate the model’s performance on the training and testing data to find the optimal balance between bias and variance. If the model has a high bias, we need to increase its complexity by adding more features, using a more complex algorithm, or increasing the number of iterations. Conversely, if the model has high variance, we must reduce its complexity by removing irrelevant features, using regularisation techniques, or increasing the training data size.
Cross-validation is a helpful technique to evaluate a model’s performance and select the best model to balance bias and variance. By splitting the data into training, validation, and testing sets, we can evaluate the model’s performance on multiple data splits and choose the model that performs well on both the training and testing data.
Bias-variance trade-off example
Let’s consider an example of the bias-variance trade-off in the context of polynomial regression. Suppose we have a set of data points that follow a non-linear relationship and want to fit a model that can capture this relationship. Then, we can use polynomial regression to fit a polynomial function to the data.
If we fit a linear function (i.e., a straight line) to the data, the model will have high bias, as it is too simple to capture the non-linear relationship between the input and output variables. As a result, the model will have high training and testing errors, indicating that it is underfitting the data.
On the other hand, if we fit a high-degree polynomial function to the data, the model will have low bias but high variance. As a result, the model will have a low training error, as it can fit the data very well, but it will have a high testing error, as it overfits the data and fails to generalise to new, unseen data.
We can use cross-validation to evaluate the model’s performance on multiple data splits to find the optimal balance between bias and variance. We can train models with different degrees of polynomial functions and select the model that achieves the best balance between bias and variance, i.e., the model that has the lowest testing error.
For example, if a quadratic polynomial function (i.e., a second-degree polynomial) achieves the best balance between bias and variance. This model is more complex than a linear model, but it is simple enough to overfit the data. By finding the right balance between bias and variance, we can develop a model that can capture the non-linear relationship between the input and output variables and generalise well to new, unseen data.
Bias-variance trade-off Python code example
Here’s an example code snippet in Python to illustrate the bias-variance trade-off using polynomial regression:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Generate some synthetic data with a non-linear relationship
np.random.seed(0)
x = np.linspace(-5, 5, num=100)
y = x ** 3 + np.random.normal(size=100)
# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
# Fit polynomial regression models with different degrees of polynomials
degrees = [1, 2, 3, 4, 5]
train_errors, test_errors = [], []
for degree in degrees:
# Transform the features to polynomial features
poly_features = PolynomialFeatures(degree=degree)
x_poly_train = poly_features.fit_transform(x_train.reshape(-1, 1))
x_poly_test = poly_features.transform(x_test.reshape(-1, 1))
# Fit the linear regression model to the polynomial features
model = LinearRegression()
model.fit(x_poly_train, y_train)
# Evaluate the model on the training and testing data
y_pred_train = model.predict(x_poly_train)
y_pred_test = model.predict(x_poly_test)
train_error = mean_squared_error(y_train, y_pred_train)
test_error = mean_squared_error(y_test, y_pred_test)
train_errors.append(train_error)
test_errors.append(test_error)
# Plot the training and testing errors as a function of the degree of polynomial
import matplotlib.pyplot as plt
plt.plot(degrees, train_errors, label='Training error')
plt.plot(degrees, test_errors, label='Testing error')
plt.legend()
plt.xlabel('Degree of polynomial')
plt.ylabel('Mean squared error')
plt.show()
In this code, we generate synthetic data with a non-linear relationship and split it into training and testing sets. Then, we fit polynomial regression models with different degrees of polynomials and evaluate their performance on the training and testing data. Finally, we plot the training and testing errors as a function of the degree of polynomial to visualise the bias-variance trade-off.
Observing the plot shows that the training error decreases as the degree of polynomial increases, indicating that the model becomes more complex and better fits the training data.
Due to the scale on the graph, we can’t see that the test error starts to increase again as the polynomial increases. When we print the test error, we can observe this phenomenon:
print(test_errors)
367.3606600042872
367.89470510195736
0.8264371039076602
0.8460879311084801
0.8399674514960231
The testing error decreases and increases as the degree of polynomial increases, indicating that the model first achieves a good balance between bias and variance and then overfits the data. In this case, we might choose a second-degree polynomial as it reaches the best balance between bias and variance.
Bias-variance trade-off SVM
Support Vector Machines (SVM) is a robust machine learning algorithm that can be used for classification and regression tasks. The bias-variance trade-off is also applicable to SVMs.
In SVMs, the trade-off between bias and variance is controlled by choice of the regularisation parameter C and the kernel function. The regularisation parameter C controls the penalty for misclassifying points in the training data. A higher value of C leads to a more complex model that can fit the training data better but may overfit. Conversely, a lower value of C leads to a simpler model with a higher bias but lower variance.
The choice of the kernel function also affects the bias-variance trade-off in SVMs. The linear kernel is a simple, low-variance option that works well when the data is linearly separable. On the other hand, non-linear kernels such as the polynomial or Gaussian (RBF) kernels are more complex and can capture more complex patterns in the data. However, they may lead to overfitting and higher variance if the regularisation parameter is not chosen carefully.
To find the optimal value of the regularisation parameter C and the kernel function, cross-validation is used to evaluate the model’s performance on multiple data splits. This helps select the hyperparameters that best balance bias and variance.
The bias-variance trade-off must be remembered when working with SVMs or other machine learning algorithms. Choosing an appropriate regularisation parameter and kernel function can help to strike a good balance between underfitting and overfitting and lead to models that generalise well to new data.
Bias-variance trade-off in SVMs Python code example
Here’s an example code snippet in Python using scikit-learn to demonstrate the bias-variance trade-off in SVMs:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.datasets import make_classification
# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=0, random_state=42)
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define a range of values for the regularization parameter C to search over
param_grid = {'C': [0.01, 0.1, 1, 10, 100]}
# Create a grid search object to search over hyperparameters
grid_search = GridSearchCV(SVC(kernel='rbf'), param_grid, cv=5)
# Fit the grid search object to the training data
grid_search.fit(X_train, y_train)
# Print the best hyperparameters found by the grid search
print("Best hyperparameters:", grid_search.best_params_)
# Evaluate the model on the test data
svm = grid_search.best_estimator_
print("Test accuracy:", svm.score(X_test, y_test))
We first generate a synthetic dataset with 1000 samples and 10 features here. Then we split the data into training and test sets using an 80/20 split.
Next, we define a range of values for the regularisation parameter C to search over using a grid search object. Finally, we use 5-fold cross-validation to evaluate the performance of each combination of hyperparameters.
We fit the grid search object to the training data and then print the best hyperparameters found by the grid search. We then evaluate the performance of the SVM model with the best hyperparameters on the test data using the scoring method.
We can find an optimal balance between bias and variance in the SVM model by searching over a range of values for the regularisation parameter C and selecting the best hyperparameters based on cross-validation performance. This approach can help prevent overfitting and ensure the model generalises well to new, unseen data.
Conclusion
The bias-variance trade-off is a fundamental concept in machine learning that refers to the trade-off between the ability of a model to fit the training data (low bias) and its ability to generalise to new, unseen data (low variance). A model with high bias is too simple to capture the underlying pattern in the data, while a model with high variance needs to be more complex and overfit the data.
The goal of a machine learning practitioner is to find the optimal balance between bias and variance by choosing an appropriate model complexity and regularisation technique. Then, cross-validation can be used to evaluate the model’s performance on multiple data splits and select the model that achieves the best balance between bias and variance.
By understanding the bias-variance trade-off, machine learning practitioners can develop models that can capture the underlying pattern in the data and generalise well to new, unseen data.
Now that you understand the bias-variance trade-off, make sure you also understand endogenous and exogenous variables and the problems they cause.
0 Comments