Top 7 Feature Selection Techniques In ML & How To Guides In Python (sklearn)

What is feature selection in machine learning?

Feature selection is a crucial step in machine learning that involves choosing a subset of relevant features (variables or attributes) from the original set of features to improve model performance and reduce the risk of overfitting. Proper feature selection can lead to more efficient models, faster training times, and better generalization.

Top 7 methods and techniques for feature selection

Here are some common methods and techniques for feature selection in machine learning:

1. Filter Methods:

Correlation: Identify and remove highly correlated features, which often provide redundant information. You can use correlation coefficients like Pearson’s or Spearman’s rank correlation.
Information Gain and Mutual Information: Measure the information gained from each feature in relation to the target variable. Features with low information gain can be removed.

2. Wrapper Methods:

Forward Selection: Start with an empty set of features and iteratively add one feature at a time, selecting the one that most improves model performance.
Backward Elimination: Begin with all features and iteratively remove the one with the most negligible impact on model performance.
Recursive Feature Elimination (RFE): It recursively removes the most minor essential feature until a desired number of features is reached.

3. Embedded Methods:

Many machine learning algorithms, such as decision trees and L1-regularized models like Lasso, perform feature selection as part of their training process. They automatically assign importance scores to features, which you can use for selection.

4. Regularization Techniques:

L1 Regularization (Lasso): Encourages sparsity by penalizing the absolute values of feature coefficients, leading some coefficients to become precisely zero.
L2 Regularization (Ridge): Tends to shrink feature coefficients toward zero without making them precisely zero. It can still help in feature selection indirectly.

5. Feature Importance:

Random Forests, Gradient Boosting Machines, and other ensemble methods provide feature importance scores based on how often features are used in building decision trees or boosting iterations. You can use these scores for feature selection.

5. Dimensionality Reduction:

Principal Component Analysis (PCA) and other dimensionality reduction techniques can reduce the number of features while retaining most information. However, this often comes at the cost of interpretability.

6. Domain Knowledge:

Sometimes, domain expertise can help you select relevant features based on your problem understanding. This approach is especially valuable when dealing with domain-specific data.

7. Feature Engineering:

Creating new features that are combinations of existing features or derived from domain knowledge can sometimes be more informative and practical than selecting from the original features.

When performing feature selection, it’s essential to consider the trade-off between reducing dimensionality and preserving important information. It would be best to use appropriate evaluation metrics and cross-validation to assess the impact of feature selection on your model’s performance and avoid overfitting. Experimenting with different feature selection techniques and combinations is often necessary to find the best approach for your specific machine learning problem.

Correlation-based feature selection

Correlation-based feature selection is a technique for selecting relevant features from a dataset based on their correlation with the target variable or each other. It aims to identify and retain the most informative features while removing redundant or irrelevant ones. This method is particularly useful for improving model performance and reducing dimensionality in situations where you have many features.

Correlation-based feature selection aims to identify and retain the most informative features while removing redundant or irrelevant ones.

Here’s how correlation-based feature selection works:

1. Compute Correlation with the Target Variable:

Calculate the correlation between each feature and the target variable. Common correlation coefficients include Pearson’s correlation (for continuous target variables) or point-biserial correlation (for binary target variables).
Features with high absolute correlation values (positive or negative) are considered more relevant to the target variable.

2. Thresholding:

Set a correlation threshold below which features are considered less relevant and above which are considered more suitable.
Features with correlation values below the threshold are candidates for removal.

3. Remove Redundant Features:

Calculate the correlation between pairs of features (feature-feature correlation).
Identify feature pairs with high correlation values and remove one of the features from each highly correlated pair. This step helps eliminate redundancy and multicollinearity.

4. Repeat as Needed:

You can iteratively adjust the correlation threshold and repeat the process until you achieve the desired features or correlation levels.

5. Evaluate Model Performance:

After feature selection, train your machine learning model using the selected features and evaluate its performance using appropriate metrics and cross-validation techniques.

6. Interpretability Considerations:

Remember that while correlation-based feature selection can help improve model performance and reduce dimensionality, it may not always preserve the interpretability of the model, as some context about the features’ meanings may be lost.

Here’s a basic example in Python using the pandas library to perform correlation-based feature selection:

import pandas as pd

# Load your dataset
data = pd.read_csv('your_dataset.csv')

# Compute correlations with the target variable (assuming 'target' is the target variable column)
correlations = data.corr()['target'].abs()

# Set a correlation threshold (e.g., 0.3)
threshold = 0.3

# Select features with correlations above the threshold
selected_features = correlations[correlations > threshold].index.tolist()

# Optionally, remove redundant features based on feature-feature correlations
selected_data = data[selected_features]

# Train your machine learning model using selected_data

Remember that choosing a correlation threshold is essential and depends on your problem and dataset. You may need to experiment with different thresholds to find the right balance between feature retention and dimensionality reduction.

Information Gain and Mutual Information

Information Gain and Mutual Information are essential concepts in information theory and feature selection, mainly when dealing with categorical data or discrete variables. These measures help quantify a feature’s information about the target variable or the relationship between two variables. Let’s explore each concept:

Information Gain

Information Gain is used in decision tree-based algorithms, such as ID3 and C4.5, to select the best feature for splitting a dataset based on how much it reduces uncertainty (entropy) in predicting the target variable. It’s particularly useful for classification problems with categorical features.

Entropy (H(S)): Entropy measures the impurity or uncertainty in a dataset before any split. The formula for entropy is:

H(S) = -Σ(p_i * log2(p_i))

Where p_i is the proportion of instances in class i within the dataset.

Conditional Entropy (H(S|X)): Conditional entropy measures the remaining uncertainty after a dataset is split by a specific feature X. It is calculated as follows:

H(S|X) = Σ((|S_v|/|S|) * H(S_v))

Where S_v is the subset of data for which feature X has the value v.

Information Gain: Information Gain (IG) is the reduction in entropy achieved by splitting the data on a particular feature X. It is calculated as:

IG(X) = H(S) - H(S|X)

Mutual Information

Mutual Information is a broader concept that measures the dependency or information shared between two random variables. It can be used for both categorical and continuous data. In feature selection, Mutual Information quantifies how much information one feature provides about another feature or the target variable.

Mutual Information (MI): The Mutual Information between two random variables, X and Y, is given by:

MI(X, Y) = ΣΣ(p(x, y) * log2(p(x, y) / (p(x) * p(y))))

Normalized Mutual Information (NMI): NMI is often used to scale the MI values between 0 and 1 for easier interpretation. It is calculated as:

NMI(X, Y) = MI(X, Y) / sqrt(H(X) * H(Y))

In feature selection, you can use Information Gain or Mutual Information to rank or score features based on their relevance to the target variable. Features with higher scores are considered more informative and are often selected for inclusion in the model.

Python libraries like scikit-learn provide functions for calculating Information Gain (e.g., mutual_info_classif for classification and mutual_info_regression for regression) and Mutual Information (e.g., mutual_info_score). These can be handy when performing feature selection on your datasets.

Minimum Redundancy Maximum Relevance (mRMR)

Minimum Redundancy Maximum Relevance (mRMR) is a feature selection method used in machine learning and data analysis to select a subset of highly relevant features to the target variable and minimally redundant. The goal is to improve the efficiency and effectiveness of machine learning models by reducing dimensionality while retaining critical information.

The fundamental principles of mRMR feature selection are:

Maximum Relevance: This part of mRMR focuses on selecting highly correlated features with the target variable. Features that provide the most information about predicting the target variable are considered the most relevant.
Minimum Redundancy: While selecting relevant features is essential, avoiding redundancy among the selected features is equally important. Redundant features provide similar information and can lead to overfitting. mRMR aims to choose diverse and complementary features in their information content.

The mRMR feature selection process typically involves the following steps:

Compute Relevance Scores: Calculate the relevance score for each feature by measuring its association with the target variable. Common metrics used for relevance include mutual information, information gain, or correlation coefficients.
Compute Redundancy Scores: Calculate the redundancy score for each feature by measuring the pairwise correlations or similarities between features. Features that are highly correlated with each other are considered redundant.
Combine Relevance and Redundancy Scores: Typically, mRMR uses a formula that combines the relevance and redundancy scores to compute an overall mRMR score for each feature. The formula varies depending on the specific variant of mRMR being used.
Select Features: Rank features based on their mRMR scores. The top-ranked features are selected for inclusion in the final feature subset.
Iterative approach: mRMR can be used iteratively. After selecting the first feature, you can repeat the process, considering the previously selected feature(s) and the remaining features to find the next most relevant and non-redundant feature.
Stopping Criteria: You can decide on a stopping criterion, such as a fixed number of selected features or a predefined threshold for the mRMR score.

mRMR is particularly useful when dealing with high-dimensional datasets, such as gene expression data, text data, or any domain where feature selection is essential for building efficient and interpretable models. It helps strike a balance between selecting informative features and avoiding multicollinearity.

Several software packages and libraries, such as mRMR Toolbox in MATLAB and Python libraries like PyMRMR, provide implementations of mRMR algorithms that you can use for feature selection in your machine learning projects.

Wrapper methods

Wrapper methods for feature selection are a class of techniques that involve selecting subsets of features based on how well they contribute to the performance of a specific machine learning model. Unlike filter methods, which rely on statistical measures or feature ranking, wrapper methods use a predictive model’s performance as a criterion to evaluate feature subsets. While wrapper methods can be computationally expensive due to the need to train and evaluate multiple models, they can lead to highly relevant feature subsets for specific modelling tasks. Here are some common wrapper methods:

1. Forward Selection:

Start with an empty set of features.
Iteratively add one feature at a time, selecting the feature that improves model performance the most.
Continue until a predefined stopping criterion is met (e.g., a specific number of features are selected or performance no longer improves).

2. Backward Elimination:

Start with all available features.
Iteratively remove one feature at a time, eliminating the feature with the least impact on model performance.
Continue until a predefined stopping criterion is met.

3. Recursive Feature Elimination (RFE):

Begin with all features and train the model.
Rank features based on their importance scores (e.g., feature coefficients in linear models or feature importances in tree-based models).
Remove the least essential feature and retrain the model.
Repeat until the desired number of features is reached, or performance no longer improves.

4. Bidirectional Search:

Combines forward selection and backward elimination to explore feature subsets in both directions.
Start with an empty set and add features that improve performance.
After adding a feature, evaluate whether removing current features can improve performance.

5. Stepwise Selection:

A combination of forward selection and backward elimination.
Similar to bidirectional search, it includes forward and backwards steps within the same iteration.

6. Genetic Algorithms:

It uses evolutionary algorithms to search for the best feature subset.
Operates by creating a population of feature subsets, evaluating their performance, and evolving the population over generations through processes like mutation and crossover.

7. Cross-Validation in Wrappers:

Cross-validation is often used within wrapper methods to obtain robust estimates of model performance during feature selection.
The model is trained and evaluated on different subsets of the data in each cross-validation fold, and the average performance is used to assess the feature subset’s quality.

Wrapper methods are beneficial when you have a specific machine learning model or algorithm in mind and you want to select the most relevant features for that model. However, they can be computationally intensive, especially when dealing with many features, as they require training and evaluating multiple models with different feature subsets.

Feature selection in Python with Scikit-learn (sklearn)

Scikit-learn (or sklearn) is a popular Python library for machine learning, and it provides various tools and methods for feature selection. You can use scikit-learn’s feature selection techniques to preprocess your data and improve the performance of your machine learning models. Here are some common feature selection methods available in scikit-learn:

1. VarianceThreshold:

This filter method removes features with low variance, assuming features with slight variation don’t provide much information.
Usage:

from sklearn.feature_selection import VarianceThreshold 

selector = VarianceThreshold(threshold=0.1) # Set an appropriate threshold 
X_new = selector.fit_transform(X)

2. SelectKBest:

This filter method selects the top k features based on statistical tests, such as chi-squared for classification tasks or ANOVA for regression tasks.
Usage:

from sklearn.feature_selection import SelectKBest, chi2 

selector = SelectKBest(score_func=chi2, k=5) # Choose an appropriate score function and k 
X_new = selector.fit_transform(X, y) # For classification tasks, provide the target variable y

3. SelectPercentile:

Like SelectKBest, it selects the top features based on a specified percentile of the highest-scoring features.
Usage:

from sklearn.feature_selection import SelectPercentile, chi2 

selector = SelectPercentile(score_func=chi2, percentile=10) # Choose an appropriate score function and percentile 
X_new = selector.fit_transform(X, y) # For classification tasks, provide the target variable y

4. SelectFromModel:

This method selects features based on the feature importances computed by a machine learning model. It’s often used with tree-based models like Random Forest or gradient boosting.
Usage:

from sklearn.feature_selection import SelectFromModel 
from sklearn.ensemble import RandomForestClassifier 

# Use an appropriate model 
selector = SelectFromModel(RandomForestClassifier()) # Specify the model 
X_new = selector.fit_transform(X, y) # For classification tasks, provide the target variable y

5. RFE (Recursive Feature Elimination):

This wrapper method recursively removes the least essential features based on a specified estimator’s coefficients or feature importances.
Usage:

from sklearn.feature_selection import RFE 
from sklearn.linear_model import LogisticRegression 

# Use an appropriate estimator 
selector = RFE(estimator=LogisticRegression(), n_features_to_select=5) # Specify the estimator and desired number of features 
X_new = selector.fit_transform(X, y) # For classification tasks, provide the target variable y

6. SelectByModel:

This method allows you to specify a model (e.g., L1-regularized linear regression) to select features based on their coefficients.
Usage:

from sklearn.feature_selection import SelectByModel 
from sklearn.linear_model import Lasso 

# Use an appropriate model 
selector = SelectByModel(estimator=Lasso(alpha=0.01)) # Specify the model and hyperparameters 
X_new = selector.fit_transform(X, y) # For classification tasks, provide the target variable y

These are some of the feature selection methods available in scikit-learn. Depending on your problem and dataset, you can choose the most suitable method for your feature selection needs. Consider each method’s specific requirements and assumptions and tune the hyperparameters accordingly.

Conclusion

Feature selection is crucial in machine learning and data analysis. Selecting the correct subset of features can profoundly impact your models’ performance, efficiency, and interpretability. Here are some key takeaways:

Feature selection aims to reduce dimensionality: By choosing a subset of relevant features, you can simplify your data and potentially improve your machine learning models’ generalization and predictive power.
Different feature selection methods: There are various feature selection methods, including filter methods, wrapper methods, and embedded methods. Each method has its strengths and weaknesses, and the choice of method depends on your dataset, problem, and modelling goals.
Relevance and redundancy: The balance between selecting relevant features for your target variable and avoiding redundant features is essential. Methods like mRMR focus on this balance by maximizing relevance and minimizing redundancy.
Domain knowledge matters: In many cases, domain knowledge and a deep understanding of the problem can guide effective feature selection. Combining domain expertise with automated methods can lead to better results.
Scikit-learn offers powerful tools: The scikit-learn library provides a wide range of feature selection techniques, making it easier to implement and experiment with different methods in Python.
Evaluation is critical: Regardless of the feature selection method you choose, it’s essential to evaluate the impact of feature selection on your model’s performance using appropriate metrics and cross-validation techniques.

In practice, feature selection is not a one-size-fits-all process. It often involves experimentation and a deep understanding of your data and problem domain. Careful feature selection can lead to more interpretable models, shorter training times, and improved model performance, ultimately contributing to better decision-making and insights from your data.

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.