Hyperparameter Tuning In Machine Learning And Deep Learning: Top 6 Ways & How To Tutorial

What is hyperparameter tuning in machine learning?

Hyperparameter tuning is critical to machine learning and deep learning model development. Machine learning algorithms typically have specific settings or configurations called hyperparameters that are not learned from the data but set by the user before training the model. These hyperparameters significantly impact the performance and behaviour of the model.

Hyperparameter tuning involves finding the optimal values for these hyperparameters to maximise the model’s performance. By selecting the right hyperparameter combination, we can enhance a model’s accuracy, generalisation capabilities, and convergence speed.

Hyperparameter tuning is not a one-size-fits-all approach and requires careful consideration of various factors such as the dataset, model architecture, and the problem being addressed. As a result, it often involves a combination of manual exploration, intuition, and systematic search methods to identify the best hyperparameters.

Hyperparameter tuning often involves a combination of manual exploration, intuition, and systematic search methods

Hyperparameter tuning is critical in machine learning and deep learning model development. We can improve model performance and achieve more accurate and reliable predictions by finding the optimal hyperparameter values.

Types of hyperparameter tuning in machine learning?

There are a few common systematic search methods used for hyperparameter tuning. Grid search, random search, and Bayesian optimisation are the most common examples. Grid search exhaustively evaluates all possible combinations of hyperparameters from a predefined grid, while random search randomly samples hyperparameter values from a defined distribution. And bayesian optimisation employs probabilistic models to explore the hyperparameter space intelligently based on previous evaluations.

Here is an ordered list of the different hyperparameter tuning strategies most commonly used in machine learning projects:

Grid Search: Grid search involves defining a grid of hyperparameter values and exhaustively searching all possible combinations. It evaluates the model’s performance for each combination using cross-validation and selects the best performance.
Random Search: Random search randomly samples hyperparameter values from predefined ranges. It performs multiple iterations of sampling and evaluation to find the best combination of hyperparameters. Random search is often more efficient than grid search when the search space is large.
Bayesian Optimisation: Bayesian optimisation is an iterative optimisation technique that builds a probabilistic model of the objective function (model performance) and uses it to select the next set of hyperparameters to evaluate. It aims to find the optimal hyperparameters with fewer evaluations than a grid or random search.
Gradient-Based Optimisation: Gradient-based optimisation methods use the gradients of the model performance concerning the hyperparameters to guide the search. Techniques like gradient descent, where the performance is treated as a differentiable function of hyperparameters, can be used to find locally optimal hyperparameters.
Ensemble Methods: Ensemble methods combine multiple models with different hyperparameters to make predictions. The hyperparameters of individual models can be tuned independently, and their outputs are combined to make a final prediction. This approach often leads to improved performance and robustness.
Automated Hyperparameter Tuning Libraries: Several libraries and frameworks provide automated hyperparameter tuning functionality, such as scikit-learn’s GridSearchCV and RandomizedSearchCV, Optuna, Hyperopt, and Ray Tune. These libraries simplify the process of hyperparameter tuning and offer various algorithms and strategies.

It’s important to note that hyperparameter tuning should be performed using a separate validation set or cross-validation to avoid overfitting the hyperparameters to the training data.

Additionally, tuning hyperparameters depends on the specific machine learning algorithm, and not all hyperparameters may apply to every model.

Hyperparameter tuning is an iterative and computationally expensive process. Still, it can significantly improve a model’s performance and generalisation ability by finding the optimal set of hyperparameters for a given task.

How to carry out hyperparameter tuning in machine learning

1. Grid search

Grid search is a technique for hyperparameter tuning in machine learning that involves defining a grid of hyperparameter values and systematically searching all possible combinations of these values. It is a brute-force approach that exhaustively evaluates the model’s performance for each combination of hyperparameters using cross-validation or a separate validation set.

Here are the steps involved in performing a grid search:

Define the hyperparameter grid: Determine the hyperparameters to tune and specify each hyperparameter’s range or set of values. For example, if you are adjusting the learning rate and the number of estimators for a gradient-boosting model, you could define a grid like this:

param_grid = { ‘learning_rate’: [0.1, 0.01, 0.001], ‘n_estimators’: [100, 200, 500] }
Create a model: Instantiate the machine learning model with default hyperparameter values. This model will be trained and evaluated for each combination of hyperparameters.
Perform grid search: Iterate over all possible combinations of hyperparameters from the defined grid. For each variety, set the hyperparameter values, train the model on the training data, and evaluate its performance using a validation set or cross-validation.
Select the best hyperparameters: Once all combinations have been evaluated, select the hyperparameters that resulted in the best performance metric (e.g., highest accuracy or lowest error). For example, this could be based on the average performance across cross-validation folds or the performance on the separate validation set.
Retrain the model: Finally, retrain the model using the selected best hyperparameters on the entire training dataset. This allows the model to learn from the maximum amount of data before making predictions on new unseen data.

Grid search exhaustively searches the entire hyperparameter grid, evaluating all possible combinations. While it guarantees to find the best hyperparameters within the specified grid, it can be computationally expensive, especially when dealing with many hyperparameters or a wide range of values.

To mitigate the computational cost, techniques like randomised search and Bayesian optimisation can be used, which efficiently sample hyperparameter combinations. However, grid search is still valuable when the search space is small or you want to ensure a comprehensive search across all possible combinations.

2. Random search

Random search is a technique for hyperparameter tuning in machine learning that involves randomly sampling hyperparameter values from predefined ranges or distributions. Unlike grid search, which exhaustively evaluates all possible combinations, random search explores a smaller subset of the hyperparameter space through random sampling.

Here are the steps involved in performing a random search:

Define the hyperparameter search space: Determine the hyperparameters to tune and specify the range or distribution of values for each. For example, if you are adjusting the learning rate and the number of estimators for a gradient-boosting model, you could define the search space like this:

param_dist = { 'learning_rate': [0.01, 0.1, 1.0], 'n_estimators': [100, 200, 500] }

In random search, the values are not exhaustively specified as in grid search but instead define a range or distribution to sample from.
Create a model: Instantiate the machine learning model with default hyperparameter values. This model will be trained and evaluated for each randomly sampled combination of hyperparameters.
Perform random search: Specify the number of iterations or samples to perform. Then, randomly sample a combination of hyperparameters from the defined search space for each iteration. Set the hyperparameter values, train the model on the training data, and evaluate its performance using a validation set or cross-validation.
Select the best hyperparameters: Once all iterations have been performed, select the hyperparameters that resulted in the best performance metric (e.g., highest accuracy or lowest error). For example, this could be based on the average performance across cross-validation folds or the performance on the separate validation set.
Retrain the model: Finally, retrain the model using the selected best hyperparameters on the entire training dataset. This allows the model to learn from the maximum amount of data before making predictions on new unseen data.

Random search offers several advantages over grid search. It can be more efficient when the search space is large, as it does not exhaustively evaluate all possible combinations. Randomly sampling hyperparameters can better explore the search space, potentially discovering better combinations. Additionally, random search is more flexible, as it allows you to define continuous or discrete distributions for hyperparameters.

However, it’s important to note that a random search does not guarantee to find the absolute best hyperparameters, as it samples randomly from the search space. It relies on the principle of stochastic optimisation and the hope that good combinations will be found through random sampling. To improve the efficiency of random search, you can increase the number of iterations or use techniques like stopping early to terminate unpromising combinations.

3. Bayesian optimisation

Bayesian optimisation is a sequential model-based technique for hyperparameter tuning and black-box function optimisation. It combines probability models (often Gaussian processes) with acquisition functions to efficiently search for the optimal set of hyperparameters.

Here are the critical steps involved in Bayesian optimisation:

First, define the search space: Determine the hyperparameters to optimise and specify the range or distribution of values for each hyperparameter. The search space defines the boundaries within which the optimisation will be performed.
Choose an acquisition function: An acquisition function guides the search for promising hyperparameter configurations. Commonly used acquisition functions include Probability of Improvement (PI), Expected Improvement (EI), and Upper Confidence Bound (UCB). These functions balance exploration (searching in unexplored regions) and exploitation (searching in areas with good performance).
Build an initial surrogate model: A surrogate model is built to approximate the black-box function (e.g., the model’s performance) using a few randomly sampled hyperparameter configurations. Gaussian processes (GPs) are commonly used as surrogate models due to their flexibility and ability to model uncertainty.
Iteratively optimise: The optimisation process begins by iteratively repeating the following steps until a termination condition is met:

a. Select the following hyperparameter configuration: The acquisition function is used to evaluate the following hyperparameter configuration. It balances exploration and exploitation by considering the surrogate model’s predictions and uncertainty.

b. Evaluate the black-box function: The selected hyperparameter configuration is evaluated by training and validating the model using the corresponding hyperparameters. The black-box function’s value (e.g., model performance metric) is recorded.

c. Update the surrogate model: It is updated with the newly acquired data point (hyperparameter configuration and its corresponding black-box function value). The GP model is retrained, incorporating the new information and updating the posterior distribution over the function.
Select the best hyperparameters: The optimisation process concludes after a predefined number of iterations or when a termination condition is met. Then, the hyperparameters that yield the best black-box function value are selected as the optimal hyperparameters.

Bayesian optimisation provides several advantages over other hyperparameter tuning methods. First, it efficiently explores the search space by selecting hyperparameter configurations based on their expected performance. Additionally, Bayesian optimisation incorporates a probabilistic model that captures the uncertainty, effectively balancing exploration and exploitation. This makes it particularly useful when evaluating the expensive or time-consuming black-box function.

Various libraries and frameworks, such as Optuna, Hyperopt, and GPyOpt, provide implementations of Bayesian optimisation that can be readily used in machine learning workflows.

Hyperparameter Tuning In Machine learning

What common hyperparameters need to be tuned in machine learning models?

In machine learning models, several common hyperparameters often require tuning to optimize model performance. The specific hyperparameters to tune can vary depending on the algorithm and model architecture used. Here are some of the commonly tuned hyperparameters:

Learning Rate: The step size at each iteration during gradient-based optimization algorithms like gradient descent.
The number of Hidden Units: The number of neurons or units in the hidden layers of a neural network.
Number of Layers: The depth or number of layers in a neural network or the number of trees in an ensemble model.
Activation Function: The function that introduces non-linearity in neural network layers, such as ReLU, sigmoid, or tanh.
Batch Size: The number of training examples used in each iteration of mini-batch gradient descent.
Regularization Parameter: The strength of regularization techniques like L1 or L2 regularization, which help prevent overfitting.
Dropout Rate: The probability of dropping out neurons during training in dropout regularization.
Kernel Type: The type of kernel function used in Support Vector Machines (SVM), such as linear, polynomial, or radial basis function (RBF).
Kernel Coefficient: The coefficient for the kernel function in SVM or Gaussian process models.
Number of Estimators: The number of individual models in ensemble methods like a random forest or gradient boosting.
Maximum Depth: The maximum depth allowed for decision trees or random forest models.
Minimum Samples Split: The minimum number of samples required to split an internal node in decision trees or random forest models.
Number of Neighbors: The number of nearest neighbours to consider in k-nearest neighbours (k-NN) models.
Number of Clusters: The number of clusters in clustering algorithms like k-means.
Distance Metric: The distance metric used to measure similarity or dissimilarity in k-NN or clustering algorithms, such as Euclidean, Manhattan, or cosine distance.
Number of Iterations: The maximum number of iterations for iterative optimization algorithms like gradient descent or expectation maximization.

These are just a few examples of commonly tuned hyperparameters, and the specific set of hyperparameters will depend on the algorithm and model being used. It’s important to understand the significance of each hyperparameter and how it impacts the model’s behaviour to tune them for improved performance effectively.

Scikit-learn hyperparameter tuning

Here’s an overview of some of the methods available in scikit-learn for hyperparameter tuning:

GridSearchCV: The GridSearchCV class in scikit-learn allows you to perform an exhaustive search over a predefined grid of hyperparameters. You can specify the hyperparameter grid as a dictionary or a list of dictionaries, where each dictionary represents a set of hyperparameters to be evaluated. GridSearchCV performs cross-validation on all possible combinations of hyperparameters and selects the best set based on a scoring metric.
RandomizedSearchCV: The RandomizedSearchCV class in scikit-learn performs a randomised search over a hyperparameter space. Instead of exhaustively evaluating all combinations like GridSearchCV, RandomizedSearchCV samples a fixed number of random combinations of hyperparameters. This is useful when the search space is ample, or you want to explore different hyperparameter regions quickly. It also supports a wide range of distributions for sampling hyperparameters.
Validation Curves: Scikit-learn provides the validation_curve function to plot validation curves for hyperparameter tuning. It allows you to evaluate the model’s performance across different values of a single hyperparameter. By analysing the validation curves, you can identify the range of hyperparameter values that yield the best performance and choose an appropriate value.
Learning Curves: The learning_curve function in scikit-learn helps analyse the model’s performance as a function of training set size. It allows you to visualise how the model’s performance changes with different amounts of training data. Learning curves can help identify underfitting or overfitting and guide decisions on increasing the model’s complexity or collecting more data.
Pipeline and FeatureUnion: Scikit-learn provides Pipeline and FeatureUnion classes that allow you to combine multiple preprocessing steps and estimators into a single object. These classes can be used with hyperparameter tuning techniques to optimise the entire pipeline, including preprocessing measures and model hyperparameters.
Nested Cross-Validation: Nested cross-validation is a technique used to assess the performance of a model and select the best hyperparameters simultaneously. Scikit-learn provides functionality to perform nested cross-validation, where an inner loop is used for hyperparameter tuning and an outer loop is used for performance evaluation. This approach helps prevent overfitting the hyperparameters to the evaluation metric.
Model-Specific Tuning: Some scikit-learn models have specific hyperparameters that can be tuned. For example, tree-based models like DecisionTreeClassifier have hyperparameters such as the maximum depth, minimum samples split, or the number of features to consider at each split. When instantiating the model, you can specify these hyperparameters directly and tune them using the available hyperparameter tuning techniques.

It’s important to note that scikit-learn’s hyperparameter tuning techniques suit traditional machine learning models. However, specialised libraries like Keras, PyTorch, or TensorFlow offer hyperparameter tuning functionality for deep learning models or more advanced architectures.

Overall, scikit-learn provides a versatile set of tools and techniques for hyperparameter tuning, allowing you to efficiently search for the optimal hyperparameters to improve your machine learning models.

How to decision tree classifier hyperparameter tuning example in Python

To tune the hyperparameters of a Decision Tree Classifier in Python, you can use scikit-learn’s GridSearchCV or RandomizedSearchCV to perform an exhaustive or randomised search over a predefined grid of hyperparameters. Here’s an example of how you can do this:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier

# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the hyperparameter grid
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
}

# Create a Decision Tree Classifier
dt_classifier = DecisionTreeClassifier()

# Perform grid search to find the best hyperparameters
grid_search = GridSearchCV(dt_classifier, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Print the best hyperparameters and the corresponding accuracy score
print("Best Hyperparameters: ", grid_search.best_params_)
print("Best Accuracy: ", grid_search.best_score_)

In this example, we use the Iris dataset, split it into training and test sets, and define a grid of hyperparameters to search over. The hyperparameters we tune include the criterion (gini or entropy), maximum depth of the tree, minimum samples split, and minimum samples leaf.

We create a DecisionTreeClassifier instance and use GridSearchCV to perform cross-validation on all possible combinations of hyperparameters. The best hyperparameters and corresponding accuracy scores are then printed.

You can modify the hyperparameter grid to include other hyperparameters or change their ranges to suit your needs. Additionally, you can use RandomizedSearchCV instead of GridSearchCV for a randomised search over the hyperparameter space by providing a distribution for each hyperparameter in the param_distributions argument.

Remember to evaluate the performance of the tuned model on a separate test set or through nested cross-validation to obtain a more unbiased estimate of its performance.

Hyperparameter Tuning In Deep learning

What common hyperparameters need to be tuned in deep learning models?

Deep learning models have a wide range of hyperparameters that can be tuned to optimise their performance. Here are some common hyperparameters in deep learning:

Learning Rate: The learning rate determines the step size at each iteration during gradient descent optimisation. It controls how quickly the model adjusts its weights based on the gradient of the loss function. A higher learning rate can lead to faster convergence but may risk overshooting the optimal solution. In comparison, a lower learning rate can lead to slower convergence or getting stuck in suboptimal solutions.
Number of Hidden Layers: The number of hidden layers defines the depth of the neural network. Increasing the number of layers allows the model to capture more complex patterns and representations. However, adding too many layers can lead to overfitting or increase the computational complexity.
Number of Neurons per Hidden Layer: The number of neurons in each hidden layer determines the capacity or complexity of the model. A larger number of neurons can capture more intricate relationships in the data but may also increase the risk of overfitting.
Activation Functions: Activation functions introduce non-linearity to the network, allowing it to model complex relationships. Common activation functions include ReLU (Rectified Linear Unit), sigmoid, and tanh. The choice of activation function can impact the model’s ability to learn and the occurrence of vanishing or exploding gradients.
Batch Size: The batch size determines the number of samples used in each training iteration. A smaller batch size introduces more stochasticity and can result in faster convergence, but it may lead to slower training overall. A larger batch size provides more stable updates but may require more memory and computational resources.
Dropout Rate: Dropout is a regularisation technique that randomly sets a fraction of neurons to zero during training, helping prevent overfitting. The dropout rate determines the probability of dropping out a neuron in a layer. It is commonly set between 0.2 and 0.5, but the optimal value depends on the dataset and model complexity.
Weight Decay / L2 Regularisation: Weight decay, also known as L2 regularisation, adds a penalty term to the loss function based on the magnitude of the model’s weights. It helps prevent overfitting by discouraging large weight values. The weight decay coefficient controls the strength of the regularisation.
Optimiser: The optimiser determines the algorithm used to update the model’s weights during training. Standard optimisers include Stochastic Gradient Descent (SGD), Adam, RMSprop, and Adagrad. Each optimiser has hyperparameters, such as momentum, learning rate decay, or beta values, that can be tuned for better performance.
Learning Rate Schedule: The learning rate schedule defines how the learning rate changes over time during training. It can be a fixed schedule, a decay schedule where the learning rate decreases gradually, or an adaptive schedule that adjusts the learning rate based on certain conditions.
Initialisation Techniques: The initialisation technique determines how the neural network weights are initialised before training. Standard methods include random initialisation, Xavier initialisation, and He initialisation. Proper initialisation can affect convergence speed and the model’s ability to escape local minima.

It’s important to note that the choice of hyperparameters and their optimal values may vary depending on the specific problem, dataset, and architecture being used. However, hyperparameter tuning techniques like grid search, random search, or Bayesian optimisation can be employed to find the optimal combination of hyperparameters for a given deep learning task.

How to deep learning hyperparameter tuning example with Keras

Here’s an example of hyperparameter tuning for a deep learning model using Keras and scikit-learn:

import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV

# Load the Iris dataset
data = load_iris()
X = data.data
y = data.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the input features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create a function to build the Keras model
def create_model(units=32, dropout=0.2):
    model = Sequential()
    model.add(Dense(units, activation='relu', input_shape=(4,)))
    model.add(Dense(units, activation='relu'))
    model.add(Dense(3, activation='softmax'))
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    return model

# Create a KerasClassifier based on the Keras model
model = KerasClassifier(build_fn=create_model, verbose=0)

# Define the hyperparameter grid
param_grid = {
    'units': [16, 32, 64],
    'dropout': [0.1, 0.2, 0.3]
}

# Perform grid search to find the best hyperparameters
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=3)
grid_search.fit(X_train_scaled, y_train)

# Get the best model and evaluate on the test set
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print("Best Hyperparameters: ", grid_search.best_params_)
print("Test Accuracy: ", accuracy)

In this example, we use the Iris dataset, split it into training and test sets, and perform standard scaling on the input features. We define a function create_model that builds a simple, fully connected neural network with a configurable number of units and dropout rate. The model is compiled with the Adam optimiser, sparse categorical cross-entropy loss, and accuracy metric.

We then create a KerasClassifier instance based on the Keras model and define a grid of hyperparameters to search over. The hyperparameters we tune include the number of units in the hidden layers and the dropout rate. We use GridSearchCV to perform cross-validation on all possible combinations of hyperparameters.

After the grid search, we obtain the best model based on the best hyperparameters found. We evaluate the best model on the test set and calculate the accuracy score. Finally, we print the best hyperparameters and the test accuracy.

You can modify the hyperparameter grid, model architecture, or other aspects to suit your deep learning task. Additionally, you can explore more advanced techniques like random search or Bayesian optimisation for hyperparameter tuning using libraries such as Optuna or Keras Tuner.

Conclusion

To optimise model performance, hyperparameter tuning is crucial in machine learning and deep learning. Selecting the right combination of hyperparameters can improve your models’ accuracy, generalisation, and convergence.

Various techniques can be employed for hyperparameter tuning, such as grid search, random search, or Bayesian optimisation. These techniques allow you to search over a predefined grid or random combinations of hyperparameters to find the optimal values. Libraries like scikit-learn, Keras, and TensorFlow provide tools and functions to facilitate hyperparameter tuning.

When tuning hyperparameters, it’s important to consider your dataset’s specific requirements, characteristics, and model architecture. Different hyperparameters have different effects on model behaviour and performance, so it’s crucial to understand their impact and choose appropriate ranges or distributions for exploration.

Furthermore, it’s essential to evaluate the performance of the tuned models using appropriate validation strategies, such as cross-validation or separate test sets. This helps ensure the selected hyperparameters generalise well to unseen data and provide reliable performance estimates.

Remember that hyperparameter tuning is an iterative process, and it may require multiple rounds of experimentation to find the best hyperparameters. Patience, careful observation, and a systematic approach are key to achieving optimal performance and building robust machine learning or deep learning models.

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.