The Curse Of Dimensionality, When It Occurs And How To Overcome It

by | Nov 29, 2022 | artificial intelligence, Data Science, Machine Learning, Natural Language Processing

When dealing with high-dimensional data, there are several issues known as the “Curse of Dimensionality.” A dataset’s quantity of attributes or features is referred to as the dataset’s dimension. High-dimensional data refers to a dataset with a lot of details, typically on the order of 100 or more. The problem with high-dimensional data is that it’s hard to draw the correct conclusions. As the number of dimensions increases, it becomes easier to confuse noise for real correlations because the data error increases.

A typical example of high-dimensional data is text data. When converting text to a numerical representation (or a vector), we give each particular word a number. See our article on tf-idf as to how to do this. As a result, every word becomes a feature. We quickly have more than one hundred features, so we are forced to deal with high-dimensional data. And as a result, we need to be aware of the curse of dimensionality before further processing or making any form of decisions based on the data.

curse of dimensionality

High-dimensional data typically has more than a hundred data points.

What is the curse of dimensionality?

High-dimensional data presents several challenges when analyzing or visualizing the data to find patterns and develop machine learning models.

The “curse of dimensionality” says that:

The error grows with the number of features.

This alludes to the fact that high-dimensional machine learning algorithms are more challenging to design as patterns in the data are hard to distinguish from the noise in the data.

These algorithms frequently have running times that are exponentially related to the dimensions.

What domains are affected by the curse of dimensionality?

There are a lot of domains directly affected by the curse of dimensionality. For example, any field with data with many attributes would face this issue.

Natural Language Processing (NLP)

When working with textual data, we often turn text into vectors to get numerical input that can then be passed to a machine learning model. Unfortunately, turning text into numbers results in sparse datasets with complicated patterns. This results in the “curse of dimensionality”, which is an issue in most NLP solutions. To combat this, we often spend more time on feature engineering to reduce the number of features or use more data to increase the size of the data set.

Anomaly Detection

Finding unexpected elements or events in a dataset requires anomaly detection. Anomalies in high-dimensional data frequently display numerous attributes unrelated to their actual nature.

For example, network traffic is monitored for threats and unusual activity in cyber security. But with so much activity originating from so many different sources, it is hard to distinguish “normal” activity from that which is a threat to the system.

Machine Learning

To maintain the same level of performance in a machine learning model, a slight increase in dimensionality necessitates a significant increase in data volume. The opposite is also true. If we can reduce the number of features in our data set, we need to train our models on much fewer data. So when working on feature selection, it is crucial to stick to a good number of features that don’t lead to the curse of dimensionality.

How to combat the curse of dimensionality?

Dimensionality reduction allows us to cut the number of features and, therefore, also solve the curse of dimensionality. Dimensionality reduction transforms a high-dimensional space into a lower-dimensional space without changing its properties. As a result, this process reduces the number of input variables in a dataset. This process removes the additional variables making it easy for analysts to analyze the data, which helps algorithms produce faster and better results.

feature selection

Dimensionality reduction selects which features to keep and which to discard.

Many dimensionality reduction algorithms broadly fall into two categories: “feature selection” or “feature extraction” techniques.

Feature selection techniques

In feature selection techniques, the attributes are tested to determine their value before being chosen or rejected. The methods for feature selection that are most frequently used are discussed below.

Low Variance Filter

This method disregards attributes with a very low variance after comparing the variance in the dataset’s distribution of all the features. As a result, fewer variable attributes will be assumed to have a nearly constant value and will not improve the model’s predictability.

High Correlation Filter

The pair-wise correlation between attributes is found using this method. One feature is dropped in the pairs with a very high correlation while the other is kept. As a result, the retained feature captures the variation in the eliminated attributes.


If each attribute is regressed as a function of the others, we may see that the others entirely capture the variability of some features. Sometimes, a high correlation may not be found for pairs of attributes. Multicollinearity is the term for this feature, and the variance inflation factor (VIF) is widely used to identify multicollinearity. High VIF values—generally greater than 10—eliminate attributes.

Feature Ranking

The attributes can be ranked according to their significance or contribution to the model’s predictability using decision tree models like CART. Some lower-rated variables in high-dimensional data may be removed to reduce the dimensions.

Feature Extraction Techniques 

The high dimensional attributes are combined into low dimensional components (PCA or ICA) or factored into low dimensional factors in feature extraction techniques (FA).

Principal Component Analysis (PCA)

A dimensionality-reduction technique known as principal component analysis (PCA) transforms highly correlated, high-dimensional data into a set of uncorrelated, lower-dimensional components known as principal components. The lower-dimensional principal components capture the majority of the data in the high-dimensional dataset. A subset of these principal components is chosen based on the percentage of variance in the data intended to be captured through the principle components after n-dimensional data is transformed into n-principal components. A straightforward example of transforming 10-dimensional data into 10 principal components is when only 3 principal components are required to account for 90% of the variance in the data. As a result, it is possible to condense a 10-dimensional dataset into just 3.

Factor Analysis (FA)

A dataset’s observed attributes are all assumed to be able to be represented as a weighted linear combination of latent factors in factor analysis. This method’s underlying premise is that n dimensions of data can be represented by m factors (mn). The primary distinction between PCA and FA is that, whereas PCA builds components from the fundamental attributes, FA breaks down the attributes into latent factors.

Independent Component Analysis (ICA)

ICA resolves the variables into a combination of these independent components. It does this under the assumption that all attributes are a mixture of separate components. ICA is typically used when PCA and FA fail because it is thought to be more reliable than PCA.

Key Takeaways – Curse of Dimensionality

  • When dealing with high-dimensional data, there are several issues known as the “Curse of Dimensionality.” First, the error grows with the number of features. Second, high-dimensional machine learning algorithms are more challenging to design because patterns are hard to distinguish from the noise in the data.
  • Natural language processing, abnormality detection, and more general machine learning problems are the three main areas that are affected by the curse of dimensionality.
  • Dimensionality reduction is transforming a high-dimensional data set into a lower-dimensional one. Dimensionality reduction reduces the number of input variables in a dataset. This makes it easier for analysts to analyze and more intuitive for algorithms to work with. In addition, the lack of additional variables makes analysis faster and more effective.
  • There are two types of dimensionality reduction algorithms: “feature selection” and “feature extraction.” The main feature selection algorithms are; low variance filter, high correlation filter, Multicollinearity and feature ranking. While the most prominent feature extraction algorithms are; Principal Component Analysis (PCA), Factor Analysis (FA) and Independent Component Analysis (ICA).
  • Feature selection is the process by which attributes are tested to determine their value before they are chosen or rejected. In contrast, feature extraction techniques focus on combining multiple different features back into a different, more rich set of more minor features.

The curse of dimensionality at Spot Intelligence

The curse of dimensionality is a genuine problem that needs to be carefully considered when developing machine learning models or doing an analysis. Without it, you can find all sorts of correlations in your data that aren’t significant or representative of your data. This will lead to inaccurate results or decisions being made on incorrect analysis.

At Spot Intelligence, we process text and use many natural language processing techniques. As we often transform text into vectors, we create a lot of high-dimensional data. This high-dimensional data suffers from the “curse of dimensionality.” So we, too, need to be very careful when processing our data.

A good pre-processing pipeline that optimizes the number of features for a given problem and data set helps us manage this problem. Working with data that has its feature space reduced helps remove the noise in our predictions and extractions. However, we also need to be careful with what we remove, as we don’t want to remove those features with predictive power.

Have you faced the curse of dimensionality in your projects? Have you heard of the curse of variability? What are your favourite techniques to combat the problem? We would love to hear about them in the comment section below.

About the Author

Neri Van Otten

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.

Related Articles

Most Powerful Open Source Large Language Models (LLM) 2023

Open Source Large Language Models (LLM) – Top 10 Most Powerful To Consider In 2023

What are open-source large language models? Open-source large language models, such as GPT-3.5, are advanced AI systems designed to understand and generate human-like...

l1 and l2 regularization promotes simpler models that capture the underlying patterns and generalize well to new data

L1 And L2 Regularization Explained, When To Use Them & Practical Examples

L1 and L2 regularization are techniques commonly used in machine learning and statistical modelling to prevent overfitting and improve the generalization ability of a...

Hyperparameter tuning often involves a combination of manual exploration, intuition, and systematic search methods

Hyperparameter Tuning In Machine Learning & Deep Learning [The Ultimate Guide With How To Examples In Python]

What is hyperparameter tuning in machine learning? Hyperparameter tuning is critical to machine learning and deep learning model development. Machine learning...

Countvectorizer is a simple techniques that counts the amount of times a word occurs

CountVectorizer Tutorial In Scikit-Learn And Python (NLP) With Advantages, Disadvantages & Alternatives

What is CountVectorizer in NLP? CountVectorizer is a text preprocessing technique commonly used in natural language processing (NLP) tasks for converting a collection...

Social media messages is an example of unstructured data

Difference Between Structured And Unstructured Data & How To Turn Unstructured Data Into Structured Data

Unstructured data has become increasingly prevalent in today's digital age and differs from the more traditional structured data. With the exponential growth of...

sklearn confusion matrix

F1 Score The Ultimate Guide: Formulas, Explanations, Examples, Advantages, Disadvantages, Alternatives & Python Code

The F1 score formula The F1 score is a metric commonly used to evaluate the performance of binary classification models. It is a measure of a model's accuracy, and it...

regression vs classification, what is the difference

Regression Vs Classification — Understand How To Choose And Switch Between Them

Classification vs regression are two of the most common types of machine learning problems. Classification involves predicting a categorical outcome, such as whether an...

Several images of probability densities of the Dirichlet distribution as functions.

Latent Dirichlet Allocation (LDA) Made Easy And Top 3 Ways To Implement In Python

Latent Dirichlet Allocation explained Latent Dirichlet Allocation (LDA) is a statistical model used for topic modelling in natural language processing. It is a...

One of the critical features of GPT-3 is its ability to perform few-shot and zero-shot learning. Fine tuning can further improve GPT-3

How To Fine-tuning GPT-3 Tutorial In Python With Hugging Face

What is GPT-3? GPT-3 (Generative Pre-trained Transformer 3) is a state-of-the-art language model developed by OpenAI, a leading artificial intelligence research...


Submit a Comment

Your email address will not be published. Required fields are marked *

nlp trends

2023 NLP Expert Trend Predictions

Get a FREE PDF with expert predictions for 2023. How will natural language processing (NLP) impact businesses? What can we expect from the state-of-the-art models?

Find out this and more by subscribing* to our NLP newsletter.

You have Successfully Subscribed!