Dimensionality Reduction: Top 5 Techniques & How To Tutorial

What is dimensionality reduction in machine learning?

Dimensionality reduction is a technique used in machine learning and data analysis to reduce the number of features or variables in a dataset while preserving as much relevant information as possible. High-dimensional data can be challenging due to increased computational complexity, overfitting risk, and visualization challenges. Dimensionality reduction methods address these issues by transforming the data into a lower-dimensional representation.

Table of Contents

There are two main approaches to dimensionality reduction:

Feature Selection: In feature selection, you choose a subset of the original features to keep and discard the rest. The goal is to retain the most informative features while eliminating redundant or irrelevant ones. Feature selection methods include techniques like mutual information, correlation-based methods, and recursive feature elimination.
Feature Extraction: Feature extraction involves creating new features that are combinations or transformations of the original features. This is typically achieved through linear or non-linear transformations. Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) are popular examples of feature extraction techniques.

Top 5 dimensionality reduction techniques

Several methods stand out for their effectiveness and widespread use in the vast landscape of dimensionality reduction techniques. Each technique has strengths and weaknesses, catering to data characteristics and problem domains. This section will explore five prominent dimensionality reduction techniques:

1. Principal Component Analysis (PCA)

Principal Component Analysis, commonly called PCA, is a linear technique that transforms the data into a new set of uncorrelated variables called principal components. These components capture the maximum variance present in the data.

How the algorithm works:

Mean Centering: Subtract the mean from each feature to centre the data.
Covariance Matrix: Compute the covariance matrix of the centred data.
Eigendecomposition: Calculate the eigenvectors and eigenvalues of the covariance matrix.
Selecting Components: Sort the eigenvectors by their corresponding eigenvalues in decreasing order. These eigenvectors become the principal components.
Projection: Project the original data onto the selected principal components to obtain the reduced-dimensional representation.

PCA is widely used for feature compression, noise reduction, and data visualization. It simplifies complex data while retaining its essential structure, making it particularly valuable for exploratory analysis.

2. t-Distributed Stochastic Neighbor Embedding (t-SNE)

Unlike PCA, t-Distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear technique primarily used for visualization. It focuses on preserving the pairwise similarities between data points in high- and low-dimensional spaces.

How the algorithm works:

Similarities: Calculate pairwise similarities between data points in the high-dimensional space.
Student’s t-Distribution: Convert the pairwise similarities into probability distributions using a Student’s t-distribution with a higher probability for similar points.
Low-Dimensional Map: Construct a low-dimensional map by defining a similar probability distribution for the same data points in the lower-dimensional space.
Minimizing Divergence: Optimize the positions of data points in the low-dimensional space to reduce the divergence between the two probability distributions.

t-SNE is exceptional at revealing patterns, clusters, and structures in data that might be difficult to discern in higher dimensions. It’s commonly used for visualizing high-dimensional datasets.

t-SNE dimension reduction of the MNINST data set. source: Wikipedia

3. Autoencoders

Autoencoders are a type of neural network architecture used for unsupervised learning. They consist of an encoder network that compresses the input data into a lower-dimensional representation and a decoder network that reconstructs the original data from the compressed representation.

The Architecture:

Encoder: The encoder maps the input data to a lower-dimensional latent space representation.
Bottleneck Layer: The bottleneck layer is a crucial part of the encoder that creates the compressed representation.
Decoder: The decoder reconstructs the data from the compressed representation.

Autoencoders are versatile tools for dimensionality reduction and feature learning. They can capture complex relationships in data and are often used for denoising data, generating novel samples, and reducing dimensionality.

4. Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis (LDA) is a technique primarily used for classification tasks. Unlike PCA, LDA aims to find a projection that maximizes the separation between different classes in the dataset.

How the algorithm works:

Compute Class Means and Scatter Matrices: Calculate the mean vectors of each class and the scatter matrices (within-class and between-class scatter matrices).
Eigenvalue Decomposition: Perform eigenvalue decomposition on the inverse of the within-class scatter matrix multiplied by the between-class scatter matrix.
Selecting Discriminant Directions: Choose the eigenvectors corresponding to the largest eigenvalues to form the discriminant directions.
Projection: Project the data onto the selected discriminant directions to create the reduced-dimensional representation.

LDA is particularly beneficial when aiming to improve classification performance while reducing dimensionality. It can enhance class separability in the reduced space.

5. Kernel PCA

Kernel Principal Component Analysis (Kernel PCA) extends traditional PCA by applying the kernel trick, allowing it to capture non-linear relationships in the data.

How the algorithm works:

Choose Kernel Function: Select a suitable kernel function (e.g., polynomial, radial basis function) to map the data into a higher-dimensional space implicitly.
Apply PCA: Perform PCA in the higher-dimensional space, calculating the eigenvalues and eigenvectors.
Selecting Components: Choose the principal components based on the eigenvalues.
Back to Original Space: Project the selected components back to the original space, yielding the reduced-dimensional representation.

Kernel PCA is effective when dealing with data that exhibits complex non-linear patterns. It’s valuable for retaining the advantages of PCA while accounting for non-linearities.

Each of these dimensionality reduction techniques offers a unique approach to addressing the challenges of high-dimensional data. Principal Component Analysis, t-Distributed Stochastic Neighbor Embedding, Autoencoders, Linear Discriminant Analysis, and Kernel PCA cater to different objectives, from variance capture to data visualization and non-linear pattern recognition. By understanding the strengths and applications of these techniques, you can make informed decisions on which to employ based on your data’s characteristics and the goals of your analysis.

In the next section, we’ll delve into the factors that should be considered when selecting a dimensionality reduction technique, along with practical tips for applying them effectively to real-world datasets.

What factors should you consider when applying dimensionality reduction?

As you delve into dimensionality reduction, it’s essential to approach the task with a clear understanding of various factors that can influence your choice of technique. Each technique comes with its own set of characteristics, advantages, and limitations. Here are some key factors to consider when selecting the most suitable dimensionality reduction approach for your data and objectives:

1. Data Characteristics

Data Distribution: Is your data linear or non-linear? Some techniques, like PCA, assume linearity, while others, like Kernel PCA, can capture non-linear relationships.

Noise Levels: The presence of noise in the data might impact the performance of specific techniques. Methods like Autoencoders can handle noisy data better due to their learning capabilities.

Data Type: Are your features numerical, categorical, or mixed? Different techniques have different requirements and assumptions about the data types they work best with.

2. Preservation of Information

Variance Preservation: If retaining as much variance as possible is crucial, techniques like PCA might be suitable as they focus on capturing the maximum variance in the data.

Local vs. Global Structure: Depending on whether you’re interested in preserving local or global structures, methods like t-SNE are adept at preserving local relationships, while techniques like PCA focus on global patterns.

3. Computational Complexity

Scalability: Consider the size of your dataset. Some techniques might become computationally expensive as the dataset grows. Random Projection is known for its efficiency in handling large datasets.

4. Interpretability and Visualization

Interpretability: Depending on your goals, you might prioritize techniques that offer more interpretable components or dimensions.

Visualization: If your primary objective is visualization, techniques like t-SNE and PCA can help you map high-dimensional data into a 2D or 3D space.

5. Overfitting and Generalization

Overfitting: Techniques that focus on feature selection or extraction can help mitigate the risk of overfitting by reducing the complexity of the model.

Generalization: Some techniques, like Autoencoders, can learn representations that generalize well to unseen data.

6. Domain Knowledge

Domain Understanding: Your familiarity with the data and the problem domain can guide your choice of technique. Specific techniques might align better with the inherent characteristics of the data.

7. Experimentation and Iteration

Experimentation: It’s often beneficial to experiment with multiple techniques to see which performs best for your task.

Hyperparameters: Many techniques have hyperparameters that can impact their performance. Experimenting with these parameters can lead to better results.

8. Trade-offs

Keep in mind that every dimensionality reduction technique involves trade-offs. While these methods can simplify data and enhance analysis, they can also result in information loss to varying degrees. It’s essential to strike a balance between dimensionality reduction and the preservation of crucial information, aligning with your project’s objectives.

Applications of dimensionality reduction

Dimensionality reduction techniques find applications across a wide range of domains, each benefiting from the ability to distil complex, high-dimensional data into more manageable and informative representations. Let’s explore some practical applications where dimensionality reduction plays a pivotal role:

1. Image and Video Analysis

In computer vision, images and videos are often represented as high-dimensional pixel arrays. Dimensionality reduction techniques allow us to extract essential features and patterns from these images, aiding tasks such as object recognition, facial expression analysis, and image clustering. PCA and Autoencoders are commonly used to reduce the dimensionality of image data, making it easier to train models and recognize objects efficiently.

2. Natural Language Processing (NLP)

In text analysis, documents, sentences, or words are often represented in high-dimensional vector spaces. Dimensionality reduction can help uncover hidden semantic relationships between words, topics, and documents. For instance, techniques like Latent Semantic Analysis (LSA) and t-SNE are employed to visualize text data structure, making it easier to analyze and understand textual information.

3. Genomics and Bioinformatics

Biological data, such as gene expression profiles, often involve many features. Dimensionality reduction aids in identifying essential genes or features that contribute to specific biological phenomena. By reducing the dimensionality of gene expression data, researchers can pinpoint relevant genes and gain insights into genetic patterns associated with diseases or conditions.

4. Recommender Systems

Recommender systems aim to provide personalized recommendations to users based on their preferences. These systems often operate in high-dimensional spaces of user-item interactions. Dimensionality reduction helps uncover latent factors that influence user preferences and item characteristics. Matrix factorization techniques, including NMF and Singular Value Decomposition (SVD), are commonly used to create meaningful user and item representations for recommendation tasks.

5. Finance and Economics

In finance, analyzing market data with a high number of variables can be challenging. Dimensionality reduction techniques enable traders and analysts to identify relevant market factors and reduce the complexity of financial models. These methods contribute to risk assessment, portfolio optimization, and anomaly detection.

6. Healthcare and Medical Imaging

Medical imaging data, such as MRI scans, is inherently high-dimensional and often requires complex processing. Dimensionality reduction assists in detecting anomalies, segmenting tissues, and even identifying potential disease markers—techniques like PCA and manifold learning aid medical professionals in visualizing and understanding complex image data.

7. Anomaly Detection

In various industries, detecting anomalies or outliers in data is critical for quality control and security. Dimensionality reduction helps create compact representations of normal data, making deviations from the norm more noticeable. Anomalies stand out in the reduced-dimensional space, facilitating efficient detection.

8. Enhancing Visualization

One of the most immediate benefits of dimensionality reduction is improved data visualization. Techniques like t-SNE and PCA project high-dimensional data into lower-dimensional spaces, making it possible to visualize clusters, patterns, and relationships that might not be apparent in the original high-dimensional space..

Dimensionality reduction example

Let’s look at a simple example of dimensionality reduction using Principal Component Analysis (PCA):

Let’s say you have a dataset of images, each represented as a vector of pixel values. Each image is 100×100 pixels, so you have 10,000 dimensions (features) for each image. However, working with such high-dimensional data can be computationally expensive and challenging for visualization.

You want to reduce the dimensionality of the dataset while retaining the most critical information. PCA can help with this.

This is how PCA would be applied to the image data:

Data Preparation: You have a dataset with 1000 images, each represented by a 10,000-dimensional vector (100×100 pixels).
Mean Centering: Subtract the mean from each feature across all images. This step ensures that the data is centred around the origin.
Calculate Covariance Matrix: Compute the covariance matrix of the mean-centered data.
Compute Eigenvalues and Eigenvectors: Calculate the eigenvalues and eigenvectors of the covariance matrix. The eigenvectors represent the directions (principal components) with the highest variance in the data.
Select Principal Components: Sort the eigenvectors in decreasing order based on their corresponding eigenvalues. The eigenvectors with the highest eigenvalues capture the most variance. You choose a certain number of top eigenvectors to retain, effectively reducing the dimensionality.
Project Data onto Lower-Dimensional Space: Multiply the original mean-centred data by the selected eigenvectors to project the data onto the lower-dimensional space defined by these eigenvectors. The result is a new dataset with reduced dimensions.

PCA helps identify the directions (principal components) in which the data varies the most. By projecting the data onto these components, you create a new representation of the data that captures the most essential variations while using fewer dimensions. This can make subsequent tasks like visualization, clustering, or classification more manageable and efficient.

It’s important to note that the choice of the number of principal components to retain (e.g., 50 in the example) is a hyperparameter that can be determined based on factors like the desired amount of variance to preserve or the specific task you’re working on.

How to implement dimensionality reduction in Python

We can now implement the above example to perform dimensionality reduction using Principal Component Analysis (PCA) in Python using the popular machine learning library, scikit-learn. Instead of images, we generate random data to make the example easier to replicate.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Generating some fake data for this demonstration
np.random.seed(42)
num_samples = 100

# Create correlated data with a positive correlation
mean = [5, 7]
cov = [[2, 1.5], [1.5, 2]]
data = np.random.multivariate_normal(mean, cov, num_samples)

# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

# Apply PCA

# Instantiate PCA with the number of components you want to retain 
num_components = 2
pca = PCA(n_components=num_components)

# Fit PCA to the scaled data
pca_data = pca.fit_transform(scaled_data)

# Explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_
print("Explained Variance Ratio:", explained_variance_ratio)

# Visualize the original and PCA-transformed data
plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.scatter(data[:, 0], data[:, 1])
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Original Data')

plt.subplot(1, 2, 2)
plt.scatter(pca_data[:, 0], pca_data[:, 1])
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA Transformed Data')

plt.tight_layout()
plt.show()

dimensionality reduction: pca plot 2 dimensions

In this example, we first generate some synthetic data with 100 samples and positively correlate this data. Then, we use PCA to reduce the dimensionality to 2 components. The explained_variance_ratio_ attribute tells us the proportion of total variance explained by each of the selected principal components. Finally, we visualize the reduced data in a scatter plot.

Remember that in a real-world scenario, you would replace the synthetic data with your dataset and adjust the parameters accordingly. Also, don’t forget to scale your data before applying PCA to ensure all features are on the same scale.

Dimensionality reduction conclusion

Dimensionality reduction is a crucial technique in machine learning and data analysis. By transforming high-dimensional data into more manageable representations, we overcome the challenges of computational complexity, overfitting, and visualization. In this journey, we’ve explored several dimensionality reduction techniques:

Principal Component Analysis (PCA): Captures variance, simplifies data, and aids visualization.
t-Distributed Stochastic Neighbor Embedding (t-SNE): Visualizes complex data patterns and clusters.
Autoencoders: Learns intricate relationships and compresses data using neural networks.
Linear Discriminant Analysis (LDA): Enhances class separability for classification tasks.
Kernel PCA: Extends PCA to capture non-linear relationships in higher dimensions.

When selecting a technique, consider factors such as your data’s characteristics, preservation of information, computational efficiency, and interpretability. Dimensionality reduction finds applications across diverse domains, from image analysis and natural language processing to finance, healthcare, and beyond. These techniques empower us to explore data in new ways, extract meaningful insights, and make informed decisions.

As you apply dimensionality reduction to your projects, remember that the journey involves experimentation and iteration. By understanding the techniques’ strengths and limitations and aligning them with your objectives, you can uncover hidden patterns, simplify complex data, and unlock the potential for innovation in your data-driven endeavours.