Deep Dive into Active Learning in Machine Learning [How To In Python & PyTorch]

What is active learning in machine learning?

Active learning is a machine learning technique that involves iteratively selecting and labelling the most informative examples from an unlabeled dataset to improve the performance of a predictive model. The key idea behind active learning is to intelligently choose which examples to label rather than randomly selecting samples or labelling the entire dataset. This approach aims to reduce the data required to achieve the labelled desired level of model accuracy.

In traditional machine learning, a model is trained using a fixed dataset with labelled examples. However, in many scenarios, labelling data can be expensive or time-consuming. Active learning addresses this challenge by focusing on labelling the most valuable instances, which can lead to better model performance with fewer labelled examples.

What does the active learning process look like?

1. Initialization: A small random subset of the unlabeled data is initially selected and labelled.

2. Model Training: A machine learning model is trained using the labelled data.

3. Query Strategy: A query strategy selects the most informative instances from the remaining unlabeled data. The goal is to choose complex or uncertain examples for the current model, as these instances will likely provide the most valuable information.

Common query strategies include:

Uncertainty Sampling: Select instances for which the model is uncertain about the correct label, such as examples with low prediction confidence or high entropy.
Margin Sampling: Choose instances where the predicted probabilities of the top two classes are close, indicating uncertainty.
Density-Based Sampling: Select instances from regions of high data density or low representational coverage by the current labelled data.
Expected Model Change: When labelled, choose instances expected to cause the most significant change in the model’s predictions.

4. Labelling: The selected instances are labelled by an oracle (human annotator or domain expert).

5. Model Update: The labelled instances are added to the training set, and the model is retrained using the updated dataset.

6. Repeat: Steps 3 to 5 are repeated iteratively, with the model becoming progressively more accurate and requiring fewer labelled examples over time.

Active learning is an iterative process with model performance getting more accurate over time.

Active learning is instrumental in scenarios with limited labelled data, such as medical diagnosis, text classification, image recognition, etc. By selecting examples strategically, active learning can help achieve comparable performance to traditional methods that require larger labelled datasets.

However, active learning is not a one-size-fits-all solution, and its effectiveness depends on factors such as the choice of query strategy, the nature of the dataset, and the type of model being used. It’s essential to carefully design the active learning process to achieve the best results for a specific problem.

How can you apply active learning to deep learning?

Active learning can also be applied to deep learning, extending the principles of selecting informative instances for labelling to neural networks and other deep architectures. The fundamental idea remains the same: to iteratively determine the most instructive examples to mark and train a deep learning model using a smaller amount of labelled data while maintaining or improving its performance.

Applying active learning to deep learning has the potential to achieve similar benefits as in other machine learning paradigms, including faster convergence to high performance and the ability to reduce the labelled dataset size significantly. However, there are a few challenges and considerations specific to active learning in deep learning:

Sample Efficiency: Deep learning models often require a more significant number of labelled examples to achieve good performance, which might limit the extent of reduction in labelled data required by active learning.
Query Strategy Complexity: Deep models might exhibit complex decision boundaries, making selecting informative instances more challenging. Selecting overly complicated or mislabeled samples can hurt the active learning process.
Oracle Labeling: Labeling instances for deep learning models can be more time-consuming and costly, as the complexity of the data often requires domain expertise.
Representation Learning: Incorporating representation learning techniques (like pre-trained embeddings or self-supervised learning) can improve the effectiveness of active learning by leveraging the initial model’s learned representations.

Incorporating active learning into the deep learning workflow requires carefully considering the query strategy, model architecture, and available resources. While active learning might not always lead to massive reductions in labelled data, it can still provide valuable efficiency gains and help improve model generalization.

What are the possible query strategies?

Query strategies are methods used in active learning to select the most informative instances from the unlabeled dataset for labelling. These strategies determine which examples will likely provide the most valuable information to the model’s learning process. The effectiveness of a query strategy directly impacts the efficiency and performance gains achieved through active learning. Here are some commonly used query strategies:

1. Uncertainty Sampling:

Least Confident: Select instances where the model is least confident in its predictions. Choose examples with the lowest predicted class probability.
Margin Sampling: Choose instances where the difference between the top two predicted class probabilities is small. High uncertainty is indicative of the cases near the decision boundary.

2. Entropy Sampling:

Select instances with high entropy in the predicted class probabilities. High entropy indicates higher uncertainty and a lack of model confidence.

3. Query by Committee:

Maintain an ensemble of models, each trained on different subsets of the labelled data.
Select instances on which the ensemble of models disagrees the most. Samples with diverse predictions are likely to be ambiguous and informative.

4. Expected Model Change:

Choose instances expected to cause the most considerable change in the model’s predictions when added to the training set.
This strategy often requires running model simulations to estimate the impact of including an instance.

5. Density-Based Sampling:

Select instances from regions of the feature space with low data density or underrepresented areas in the labelled dataset.
These instances can help improve the model’s coverage of the input space.

6. Core-Set Selection:

Aim to select instances that represent diverse regions of the data distribution.
This approach helps ensure the model covers many cases and scenarios.

7. Bayesian Methods:

Utilize Bayesian neural networks to capture uncertainty in the model’s parameters.
Sample instances that maximize the reduction in model uncertainty or have high predictive variance.

8. Variation Ratios:

Measure the variation in predictions for each instance by calculating the ratio of the maximum predicted probability to the sum of all predicted probabilities.
Choose instances with high variation ratios.

9. Information Density:

Compute the information density of each instance based on the distance to its nearest neighbours.
Select examples with low information density to capture regions with sparse data.

10. Cluster-Based Strategies:

Cluster the unlabeled data and select instances from underrepresented clusters.
This helps in capturing diverse and less-explored areas of the dataset.

The choice of query strategy depends on factors such as the dataset characteristics, the model architecture, and the available computational resources. It’s common to experiment with multiple query strategies and evaluate their performance on a validation set to determine the most effective method for a specific problem. Also, hybrid approaches that combine various techniques or adaptively switch between them can be practical in active learning scenarios.

What are the limitations of active learning in machine learning and deep learning?

While active learning offers numerous benefits for optimizing data labelling efforts and improving model performance, limitations and challenges must be considered. Here are some of the critical rules of active learning in both machine learning and deep learning:

Query Strategy Sensitivity: The effectiveness of active learning heavily depends on the choice of query strategy. Different strategies work better for other problem domains and datasets. Selecting an inappropriate or suboptimal strategy can lead to subpar results.
Curse of Dimensionality: In high-dimensional feature spaces, finding informative instances can become more challenging due to the increased sparsity of data. This can make selecting informative samples less effective.
Labelling Effort: While active learning reduces the overall labelling effort, it doesn’t eliminate it. Human labelling remains a bottleneck, especially in domains that require domain expertise, leading to costs and potential delays.
Oracle Dependency: Active learning often assumes access to an “oracle” that can provide accurate labels for selected instances. However, labelling can be subjective or uncertain in practice, leading to potential inaccuracies.
Model Biases: If the initially labelled dataset contains biased or unrepresentative examples, active learning may reinforce these biases by selecting similar instances. This can negatively impact model fairness and generalization.
Exploration-Exploitation Trade-off: Active learning must balance exploring uncharted areas of the data distribution and exploiting existing information. Incorrect balance can lead to suboptimal results.
Limited Sample Pool: In some cases, the pool of available unlabeled instances may not represent the entire data distribution, leading to suboptimal instance selection.
Reduced Gain with High-Labeled Data: As the labelled dataset grows, the potential gains from active learning may diminish. The model might already have captured most of the underlying patterns.
Adversarial Instances: Some query strategies might select negative examples that exploit model vulnerabilities, which can be a concern, especially in security-sensitive applications.
Resource Intensive: Active learning can require additional computational resources and time due to the need for iterative model training and evaluation.
Domain Dependence: The effectiveness of active learning can vary across different domains, problem types, and data distributions. Strategies that work well for one problem might not generalize to others.
Noise Sensitivity: If the initial labelled data or the oracle’s labels contain noise, the active learning process might propagate this noise and result in misleading selections.

Despite these limitations, active learning remains valuable in the machine learning and deep learning toolbox. By understanding these challenges and carefully designing the active learning process, you can effectively harness its benefits and overcome its limitations.

How to apply active learning with Python

Implementing active learning in machine learning using Python involves integrating active learning strategies into your workflow. Here’s a high-level guide on how to get started:

1. Data Preparation:

Load your dataset, splitting it into an initially labelled dataset and an unlabeled dataset.
Prepare your feature vectors and labels.

2. Model Selection and Training:

Choose a machine learning model suitable for your problem (e.g., SVM, Random Forest, Neural Network).
Train the initial model using the labelled dataset.

3. Implement Query Strategies:

Choose one or more query strategies based on your problem characteristics and goals.
Implement these strategies in your Python code. Libraries like Scikit-learn, TensorFlow, or PyTorch can be helpful for this purpose.

4. Active Learning Loop:

Set up an iterative loop that repeatedly selects instances using the chosen query strategy, gets them labelled, and updates the model.
In each iteration, use the selected instances for labelling (simulate labelling if working with a sample dataset).

5. Model Update and Evaluation:

Update the model using the newly labelled instances.
Evaluate the model’s performance on a validation or test set to track its progress.

Here’s a simplified code example demonstrating how to implement active learning using Scikit-learn for a binary classification problem:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

# Load your dataset and split into initial labeled and unlabeled sets
X, y = load_data()
X_labeled, X_unlabeled, y_labeled, _ = train_test_split(X, y, test_size=0.8, stratify=y)

# Initialize model and train on initial labeled data
model = LogisticRegression()
model.fit(X_labeled, y_labeled)

# Define the active learning loop
num_iterations = 10
batch_size = 10

for iteration in range(num_iterations):
    # Implement a query strategy (e.g., uncertainty sampling)
    uncertainty = model.predict_proba(X_unlabeled)
    uncertainty_scores = np.max(uncertainty, axis=1)
    query_indices = np.argsort(uncertainty_scores)[-batch_size:]

    # Label the selected instances
    labeled_instances = X_unlabeled[query_indices]
    labeled_labels = get_labels_for_instances(labeled_instances)

    # Update the labeled and unlabeled datasets
    X_labeled = np.concatenate((X_labeled, labeled_instances), axis=0)
    y_labeled = np.concatenate((y_labeled, labeled_labels), axis=0)
    X_unlabeled = np.delete(X_unlabeled, query_indices, axis=0)

    # Retrain the model on the updated labeled dataset
    model.fit(X_labeled, y_labeled)

    # Evaluate the model on a validation set
    validation_accuracy = accuracy_score(y_validation, model.predict(X_validation))
    print(f"Iteration {iteration+1}, Validation Accuracy: {validation_accuracy:.4f}")

Remember that this is a basic example, and you can make many variations and enhancements based on your specific problem, dataset, and model choice. To fine-tune your active learning implementation, you can experiment with different query strategies, model architectures, and evaluation metrics.

How to apply active learning with PyTorch

Implementing active learning using PyTorch involves steps similar to the previous example but tailored to the PyTorch library. Here’s a general outline of how to implement active learning using PyTorch for a binary classification problem:

1. Data Preparation:

Load and preprocess your dataset, splitting it into initially labelled and unlabeled datasets.
Define a PyTorch Dataset and DataLoader for your labelled dataset.

2. Model Selection and Training:

Choose a PyTorch model architecture (e.g., a neural network) and define it using PyTorch’s nn.Module class.
Train the initial model using the labelled dataset and a suitable loss function.

3. Implement Query Strategies:

Choose one or more query strategies based on your problem.
Implement the strategy using PyTorch tensors and operations.

4. Active Learning Loop:

Set up an iterative loop that repeatedly selects instances using the chosen query strategy, gets them labelled, and updates the model.

5. Model Update and Evaluation:

Update the model using the newly labelled instances and retrain it on the updated labelled dataset.
Evaluate the model’s performance in each iteration on a validation or test set.

Here’s a simplified code example demonstrating how to implement active learning using PyTorch:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Define your PyTorch model class
class SimpleClassifier(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleClassifier, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)
    
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Load and preprocess your data, split into labeled and unlabeled sets
X, y = load_data()
X_labeled, X_unlabeled, y_labeled, _ = train_test_split(X, y, test_size=0.8, stratify=y)

# Convert data to PyTorch tensors
X_labeled_tensor = torch.tensor(X_labeled, dtype=torch.float32)
y_labeled_tensor = torch.tensor(y_labeled, dtype=torch.long)
labeled_dataset = TensorDataset(X_labeled_tensor, y_labeled_tensor)
labeled_dataloader = DataLoader(labeled_dataset, batch_size=64, shuffle=True)

# Initialize model, loss function, and optimizer
model = SimpleClassifier(input_size, hidden_size, output_size)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Active Learning Loop
num_iterations = 10
batch_size = 10

for iteration in range(num_iterations):
    # Implement a query strategy (e.g., uncertainty sampling)
    model.eval()
    with torch.no_grad():
        uncertainty = model(X_unlabeled_tensor)
        uncertainty_scores = torch.max(uncertainty, dim=1)[0]
    query_indices = uncertainty_scores.argsort()[-batch_size:]

    # Label the selected instances
    labeled_instances = X_unlabeled_tensor[query_indices]
    labeled_labels = get_labels_for_instances(labeled_instances)

    # Update the labeled and unlabeled datasets
    X_labeled_tensor = torch.cat((X_labeled_tensor, labeled_instances), dim=0)
    y_labeled_tensor = torch.cat((y_labeled_tensor, labeled_labels), dim=0)
    X_unlabeled_tensor = torch.cat((X_unlabeled_tensor[:query_indices[0]], X_unlabeled_tensor[query_indices[0] + 1:]), dim=0)

    # Retrain the model on the updated labeled dataset
    model.train()
    for epoch in range(num_epochs):
        for batch_x, batch_y in labeled_dataloader:
            optimizer.zero_grad()
            outputs = model(batch_x)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()

    # Evaluate the model on a validation set
    validation_accuracy = evaluate_model(model, X_validation_tensor, y_validation_tensor)
    print(f"Iteration {iteration+1}, Validation Accuracy: {validation_accuracy:.4f}")

This code provides an essential structure for implementing active learning using PyTorch. You’ll need to replace the placeholders with your actual data, implement your query strategy, and define the evaluate_model function based on your evaluation needs.

Remember that the above code is meant to serve as a starting point, and you may need to adapt it to your specific problem and dataset characteristics.

Conclusion

Active learning is a powerful technique that can significantly improve the efficiency of machine learning and deep learning workflows by strategically selecting the most informative instances for labelling. By iteratively querying the most valuable examples, active learning reduces the amount of labelled data required to achieve high model performance. It is advantageous in scenarios where labelling data is expensive or time-consuming.

The key takeaways from this discussion are:

Principle of Information Selection: Active learning selects instances that challenge the model’s current understanding, leading to more effective learning and better generalization.
Query Strategies: Active learning relies on well-designed query strategies to determine which instances to label next. These strategies leverage uncertainty, diversity, and other information measures to select informative samples.
Application in Deep Learning: Active learning can be applied to deep learning architectures to optimize data labelling for complex models. The strategies may need to be adapted to accommodate the model’s complexity.
Efficiency and Performance Trade-off: While active learning reduces labelling efforts, it might not always lead to substantially reduced labelled dataset size. The trade-off between efficiency gains and performance improvement varies based on the problem and approach.
Implementation in Python: Active learning can be implemented in Python using libraries like Scikit-learn for traditional machine learning and PyTorch for deep learning. Building an active learning loop involves selecting instances, labelling them, and updating the model iteratively.
Customization: The success of active learning depends on carefully selecting query strategies, model architectures, and evaluation metrics that align with the problem’s characteristics.
Continuous Evaluation: It’s essential to evaluate the effectiveness of active learning during experimentation. Monitor model performance on validation or test data to ensure the active learning process improves model accuracy.

In summary, active learning presents a practical solution to optimizing data labelling efforts, allowing you to build accurate models with fewer labelled examples. However, successful implementation requires a good understanding of the problem domain, the available data, and the specific machine learning techniques you’re using.

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.