L1 And L2 Regularization Explained & Practical How To Examples

L1 and L2 regularization are techniques commonly used in machine learning and statistical modelling to prevent overfitting and improve the generalization ability of a model. They are regularization techniques that add a penalty term to the loss function, encouraging the model to have smaller parameter values.

Table of Contents

What is Regularization?

Regularization is a technique used in machine learning and statistical modelling to prevent overfitting and improve the generalization ability of models. When a model is overfitting, it has learned the training data too well and may not perform well on new, unseen data.

underfitting vs overfitting vs optimised fit

Regularization introduces additional constraints or penalties to the model during the training process, aiming to control the complexity of the model and avoid over-reliance on specific features or patterns in the training data. By doing so, regularization helps to strike a balance between fitting the training data well and generalizing it well to new data.

The most common regularization techniques used are L1 regularization (Lasso), L2 regularization (Ridge), and Elastic Net regularization. L1 regularization adds the sum of the absolute values of the model’s coefficients to the loss function, encouraging sparsity and feature selection. L2 regularization adds the sum of the squared values of the model’s coefficients, which enables smaller but non-zero coefficients. Finally, elastic Net regularization combines both L1 and L2 regularization.

How does regularization work?

Regularization is typically achieved by adding a term to the loss function during training. The regularization term penalizes certain model parameters and adjusts them to minimize the total loss, which consists of both the original loss (such as mean squared error or cross-entropy) and the regularization term. The strength of regularization is controlled by a regularization parameter that determines the balance between fitting the data and reducing the impact of large coefficients.

Regularization helps to prevent overfitting by discouraging complex models that may fit noise or irrelevant patterns in the training data. Instead, it promotes simpler models that capture the underlying patterns and generalize well to new data. Regularization is particularly useful when dealing with limited data, high-dimensional datasets, or models with many parameters.

L1 and L2 regularization promotes simpler models that capture the underlying patterns and generalize well to new data

It’s important to note that regularization is a form of bias introduced to the model. Therefore, the choice of regularization technique and the regularization parameter must be carefully selected and tuned based on the specific problem and dataset to strike the right balance between bias and variance in the model’s performance.

Types of regularization

1. L1 regularization

L1 regularization, also known as Lasso regularization, adds the sum of the absolute values of the model’s coefficients to the loss function. It encourages sparsity in the model by shrinking some coefficients to precisely zero. This has the effect of performing feature selection, as the model can effectively ignore irrelevant or less important features. L1 regularization is particularly useful when dealing with high-dimensional datasets with desired feature selection.

Mathematically, the L1 regularization term can be written as:

L1 regularization = λ * Σ|wi|

Here, λ is the regularization parameter that controls the strength of regularization, wi represents the individual model coefficients and the sum is taken over all coefficients.

2. L2 regularization

L2 regularization, also known as Ridge regularization, adds the sum of the squared values of the model’s coefficients to the loss function. Unlike L1 regularization, L2 regularization does not force the coefficients to be exactly zero but instead encourages them to be small. L2 regularization can prevent overfitting by spreading the influence of a single feature across multiple features. It is advantageous when there are correlations between the input features.

Mathematically, the L2 regularization term can be written as:

L2 regularization = λ * Σ(wi^2)

Similar to L1 regularization, λ is the regularization parameter, and wi represents the model coefficients. The sum is taken over all coefficients, and the squares of the coefficients are summed.

The choice between L1 and L2 regularization depends on the specific problem and the characteristics of the data. For example, L1 regularization produces sparse models, which can be advantageous when feature selection is desired. L2 regularization, on the other hand, encourages small but non-zero coefficients and can be more suitable when there are strong correlations between features.

In practice, a combination of both L1 and L2 regularization, known as Elastic Net regularization, is often used to benefit from the strengths of both techniques. Elastic Net regularization adds a linear combination of L1 and L2 regularization terms to the loss function, controlled by two parameters: α and λ. This allows for simultaneous feature selection and coefficient shrinkage.

3. Elastic Net regularization, what is it, and how does it combine L1 and L2 regularization?

Elastic Net regularization is a technique that combines both L1 and L2 regularization to achieve a balance between feature selection and weight shrinkage. During model training, it incorporates both the L1 and L2 regularization terms in the loss function.

The Elastic Net regularization term is defined as:

Elastic Net regularization = λ1 * Σ|wi| + λ2 * Σ(wi^2)

Here, wi represents the individual model coefficients and the sums are taken over all coefficients. λ1 and λ2 are regularization parameters that control the strength of L1 and L2 regularization, respectively.

Elastic Net regularization combines the advantages of both L1 and L2 regularization. The L1 regularization term encourages sparsity and feature selection, driving some coefficients to exactly zero. This helps in selecting the most relevant features and reducing the complexity of the model. On the other hand, the L2 regularization term encourages smaller but non-zero coefficients, preventing any one feature from dominating the model’s predictions and improving the model’s stability.

The values of λ1 and λ2 control the balance between L1 and L2 regularization. A higher value λ1 emphasizes sparsity, promoting feature selection, while a higher value of λ2 emphasizes weight shrinkage and overall complexity control.

Elastic Net regularization is particularly useful when dealing with datasets that have high-dimensional features and strong feature correlations. It provides a flexible regularization approach that allows for a trade-off between feature selection and weight shrinkage based on the specific problem and the desired behaviour of the model.

Implementing Elastic Net regularization involves modifying the loss function and the weight update step during model training, similar to L1 and L2 regularization. However, specialized algorithms and libraries, such as scikit-learn, efficiently implement Elastic Net regularization.

What is the difference between L1 and L2 regularization?

The main difference between L1 and L2 regularization lies in the penalty terms added to the loss function during training. Here are the key differences between L1 and L2 regularization:

Penalty Terms:

L1 Regularization (Lasso): L1 regularization adds the sum of the absolute values of the model’s coefficients (weights) to the loss function. It can be represented as λ * Σ|wi| , where wi represents the individual coefficients and λ is the regularization parameter. The L1 penalty promotes sparsity by driving some coefficients to exactly zero, effectively performing feature selection.
L2 Regularization (Ridge): L2 regularization adds the sum of the squared values of the model’s coefficients to the loss function. It can be represented as λ * Σ(wi^2) , where wi represents the individual coefficients and λ is the regularization parameter. The L2 penalty encourages smaller but non-zero coefficients, preventing any one feature from dominating the model’s predictions and promoting overall weight shrinkage.

Effect on Coefficients:

L1 Regularization: L1 regularization tends to produce sparse models by driving some coefficients to zero. It performs automatic feature selection, as features with zero coefficients are effectively ignored by the model. This can be advantageous when dealing with high-dimensional datasets or when there is a need for feature selection and interpretability.
L2 Regularization: L2 regularization leads to smaller but non-zero coefficients for all features. It reduces the impact of individual features but does not drive coefficients to zero. The L2 penalty is effective when dealing with strong feature correlations or when there is no specific need for feature selection, as it allows all features to contribute to the model’s predictions.

Complexity Control:

L1 Regularization: L1 regularization promotes sparsity and feature selection, which can help control model complexity by disregarding irrelevant or less important features. It leads to models with fewer nonzero coefficients and a simpler representation.
L2 Regularization: L2 regularization controls model complexity by shrinking the magnitudes of all coefficients. It provides more evenly distributed weight shrinkage among features, preventing any one feature from dominating the model’s predictions. L2 regularization effectively reduces overfitting and improves the model’s stability.

Optimization:

L1 Regularization: The L1 regularization term is not differentiable at zero, which poses challenges in optimization. However, subgradient methods can effectively optimize the loss function with L1 regularization.
L2 Regularization: The L2 regularization term is smooth and differentiable, making optimising using standard gradient-based optimization algorithms computationally efficient.

In practice, a combination of L1 and L2 regularization, known as Elastic Net regularization, is often used to leverage the strengths of both techniques and find a balance between sparsity and weight shrinkage. The choice between L1 and L2 regularization depends on the specific problem, the characteristics of the data, and the desired behaviour of the model.

Advantages and disadvantages

Regularization Technique	Advantages	Disadvantages
L1 (Lasso) Regularization	– Performs feature selection, driving some coefficients to zero	– Can lead to high sparsity, making the model less interpretable
	– Helps in dealing with high-dimensional datasets	– Not effective when there are strong correlations between features
	– Can handle irrelevant or less important features	– Computationally more expensive than L2 regularization
	– Useful for building sparse models
L2 (Ridge) Regularization	– Helps to prevent overfitting and improve generalization	– Doesn’t perform feature selection like L1 regularization
	– Effective when there are strong correlations between features	– The resulting model may still contain many small non-zero coefficients
	– Computes stable solutions	– May not be suitable for high-dimensional datasets
	– Computationally efficient

When should L1 regularization be used over L2 regularization, and vice versa?

L1 and L2 regularization have different characteristics, and the choice between them depends on the specific problem and the desired behaviour of the model. Here are some guidelines for when to use L1 or L2 regularization:

Use L1 regularization (Lasso):

Feature selection: When you have a high-dimensional dataset with many features, and you want to perform feature selection by driving some coefficients to precisely zero, L1 regularization is a suitable choice. It encourages sparsity, effectively selecting the most relevant features and disregarding the irrelevant or less important ones.
Interpretable models: If interpretability is important, L1 regularization can be helpful as it produces sparse models with only a subset of features having non-zero coefficients. This can help understand the most influential components in the model’s predictions.

Use L2 regularization (Ridge):

Strong feature correlations: When your dataset contains highly correlated features, L2 regularization is more effective than L1 regularization. L2 regularization distributes the impact of correlated features more evenly among the coefficients, preventing any one feature from dominating the model’s predictions.
Generalization performance: L2 regularization is known to improve the generalization performance of models by reducing overfitting. It is generally a good choice when there is no specific need for feature selection, and you want to control the overall complexity of the model.

Sometimes, a combination of L1 and L2 regularization, Elastic Net regularization, can be used. Elastic Net regularization balances feature selection (L1 regularization) and weight shrinkage (L2 regularization). It is useful when dealing with datasets that have high-dimensional features and strong feature correlations.

It’s important to note that the choice between L1 and L2 regularization is not always clear-cut and may require experimentation and evaluation of the model’s performance using different regularization techniques. Additionally, the regularization parameter must be carefully tuned to find the right balance between bias and variance in the model.

L1 and L2 regularization in deep learning

L1 and L2 regularization can also be applied in deep learning to combat overfitting and improve the generalization of neural network models.

In deep learning, L1 and L2 regularization are typically incorporated into the training process by adding their corresponding penalty terms to the loss function. The regularization terms are multiplied by a regularization parameter ( λ ) to control the strength of regularization.

For L1 regularization in deep learning, the regularization term is the sum of the absolute values of all the weights in the neural network. This encourages sparsity in the model, effectively setting some weights to zero and performing feature selection. L1 regularization helps reduce the model’s complexity and improve its interpretability.
For L2 regularization in deep learning, the regularization term is the sum of the squared values of all the weights in the neural network. It penalizes large weight values and encourages smaller weights, preventing any one weight from dominating the model. L2 regularization helps to control the model’s capacity and reduce the impact of noise in the data.

The total loss function for deep learning models with regularization is the combination of the original loss function (such as cross-entropy or mean squared error) and the regularization term:

Total loss = Original Loss + λ * Regularization Term

The regularization parameter λ controls the amount of regularization applied. A larger value λ increases the regularization strength, resulting in more shrinkage of the weights.

In practice, a common approach is to use a combination of both L1 and L2 regularization, known as Elastic Net regularization. This balances feature selection (L1 regularization) and weight shrinkage (L2 regularization).

In deep learning, the choice between L1 and L2 regularization (or their combination) depends on the specific problem, the data’s characteristics, and the model’s desired behaviour. Experimentation and tuning of the regularization parameters are often required to achieve the best results.

Practical examples of regularization

Here are a few practical examples of how regularization techniques, such as L1 and L2 regularization, can be applied in different machine learning scenarios:

Linear Regression: Regularization is commonly used in linear regression models to prevent overfitting. By adding L1 or L2 regularization to the loss function, the model can be controlled to have smaller coefficients or drive some coefficients to zero. This helps in improving the model’s generalization performance. Regularization is particularly useful when dealing with high-dimensional datasets or datasets with multicollinearity.
Logistic Regression: Logistic regression models can also benefit from regularization to prevent overfitting and improve generalization. Like linear regression, L1 or L2 regularization can be applied to the logistic regression loss function to control the complexity of the model and shrink the coefficient values. This is especially important when the feature space is large or correlated features are possible.
Neural Networks: Regularization techniques are vital in training neural networks, especially when dealing with complex models and large datasets. L2 regularization, known as weight decay in the context of neural networks, is commonly applied to the weights of the neural network layers. It helps prevent overfitting by shrinking the weights, making the network less sensitive to small changes in input data. Dropout, another regularization technique, randomly sets a fraction of the neuron outputs to zero during training, effectively reducing interdependencies and preventing co-adaptation of neurons.
Support Vector Machines (SVM): In SVM, regularization is achieved using the regularization parameter C. A higher value of C results in less regularization, allowing the model to fit the training data more closely. Conversely, a lower value of C increases the regularization strength, promoting a wider margin and better generalization. L1 regularization can also be applied to SVMs to perform feature selection by driving some feature weights to zero.
Image Classification: Regularization is widely used in image classification tasks, especially when training deep learning models like Convolutional Neural Networks (CNNs). L2 regularization is often employed to control the complexity of CNN models and prevent overfitting. It helps improve generalization and improves the model’s performance on unseen images.

These are just a few examples, but regularization techniques apply to various machine learning algorithms and tasks. The specific choice and application of regularization depend on the problem’s nature, the data’s characteristics, and the model’s desired behaviour.

What are common mistakes or pitfalls to avoid when using regularization?

When using regularization techniques, it’s essential to be aware of potential mistakes and pitfalls that can affect the effectiveness and performance of the model. Here are some common mistakes to avoid when using regularization:

Improper scaling of features: Regularization assumes that all features are on a similar scale. Some features may dominate the regularisation term if the features are not correctly scaled, leading to biased regularization effects. Make sure to scale your features before applying regularization techniques.
Incorrect choice of regularization parameter: The regularization parameter (e.g., λ in L1 or L2 regularization) controls the strength of regularization. Choosing an inappropriate value can result in under- or over-regularization. It’s crucial to tune the regularization parameter using cross-validation or other validation techniques to find the optimal balance between bias and variance in the model.
Ignoring feature interactions: Regularization techniques like L1 or L2 regularization treat each feature independently. However, in some cases, the interaction between features may be significant for the model’s performance. Ignoring feature interactions can lead to suboptimal results. Consider feature engineering techniques or other models that capture feature interactions if needed.
Inadequate feature selection: Regularization can help with feature selection, but it’s vital to correctly assess the relevance and importance of features. Blindly relying on regularization to select features without considering domain knowledge or thorough feature analysis may result in excluding important information and affecting the model’s performance.
Ignoring other sources of regularization: Regularization is just one tool to prevent overfitting. Other techniques, such as early stopping, dropout, or data augmentation, may also be beneficial in improving the generalization ability of models. Consider using a combination of regularization techniques to enhance model performance.
Over-regularization: Applying too much regularization can lead to underfitting, where the model becomes too simple and fails to capture the underlying patterns in the data. It’s vital to balance regularization and model complexity to ensure optimal performance.
Lack of interpretability: Regularization techniques like L1 regularization can drive some coefficients to exactly zero, resulting in sparse models. While sparsity can be desirable in some instances, it can also sacrifice interpretability. Consider the trade-off between model interpretability and predictive performance when applying regularization.
Not evaluating regularization performance: Regularization parameters and techniques should be considered and validated on appropriate validation or test data. Simply applying regularization without assessing its impact on model performance may lead to suboptimal results.

Overall, it’s crucial to approach regularization carefully, considering the specific problem, the data, and the desired model behaviour. Regularization should be part of a well-designed modelling pipeline with appropriate feature engineering, validation, and evaluation techniques to achieve the best possible performance.

L2 regularization in Python from scratch

To implement L2 regularization from scratch in Python, you must modify the loss function and weight update step during training. Here’s an example of how you can implement L2 regularization for a simple linear regression model:

import numpy as np

# Generate sample data
np.random.seed(42)
X = np.random.rand(100, 1)  # Input features
y = 3 * X + 2 + np.random.randn(100, 1)  # Output labels with noise

# Add bias term to input features
X_b = np.c_[np.ones((100, 1)), X]

# Define regularization parameter
lambd = 0.1

# Initialize random weights
np.random.seed(42)
theta = np.random.randn(2, 1)

# Training loop
epochs = 1000
learning_rate = 0.1

for epoch in range(epochs):
    # Compute predictions
    y_pred = X_b.dot(theta)

    # Compute mean squared error loss
    mse_loss = np.mean((y_pred - y) ** 2)
    
    # Compute L2 regularization term
    l2_regularization = 0.5 * lambd * np.sum(theta[1:]**2)

    # Compute total loss (MSE loss + L2 regularization)
    total_loss = mse_loss + l2_regularization

    # Compute gradients
    gradients = 2 / len(X_b) * X_b.T.dot(y_pred - y)
    
    # Add L2 regularization term to weight gradients
    gradients[1:] += lambd * theta[1:]
    
    # Update weights
    theta -= learning_rate * gradients

    if epoch % 100 == 0:
        print("Epoch:", epoch, "Total Loss:", total_loss)

# Print final weights
print("Final Weights:")
print(theta)

This example uses a simple linear regression model with one input feature. We initialize random weights and perform gradient descent to minimize the mean squared error loss with an additional L2 regularization term. The L2 regularization term is added to the gradients during the weight update step, where we add the sign of the weights multiplied by the regularization parameter.

Please note that this is a basic implementation of L2 regularization from scratch. In practice, it’s recommended to use machine learning libraries like scikit-learn or TensorFlow, which provide more optimized and efficient deployments of regularization techniques.

Key Takeaways

Regularization techniques such as L1 and L2 regularization are widely used in machine learning and statistical modelling to address the problem of overfitting and improve the generalization ability of models.
L1 regularization, also known as Lasso regularization, adds the sum of the absolute values of the model’s coefficients to the loss function, promoting sparsity and feature selection. It effectively deals with high-dimensional datasets and can help build sparse models. However, it can result in high sparsity, making the model less interpretable.
L2 regularization, also known as Ridge regularization, adds the sum of the squared values of the model’s coefficients to the loss function, encouraging smaller but non-zero coefficients. It helps prevent overfitting, handles strong correlations between features, and produces stable solutions. However, it does not perform feature selection like L1 regularization.
The choice between L1 and L2 regularization (or their combination, Elastic Net regularization) depends on the specific problem, the data’s characteristics, and the model’s desired behaviour. L1 regularization is suitable when feature selection is expected, while L2 regularization is effective when dealing with strong feature correlations. Elastic Net regularization combines the strengths of both techniques.
Implementing L1 and L2 regularization from scratch involves modifying the loss function and weight update step during training. However, it’s important to note that in practice, using machine learning libraries with optimized implementations is recommended for efficiency and stability.
Regularization is a valuable tool for improving model performance and reducing overfitting. Still, it requires careful selection and tuning of regularization parameters to strike the right balance between bias and variance in the model’s performance.