When dealing with high-dimensional data, there are several issues known as the “Curse of Dimensionality.” A dataset’s quantity of attributes or features is referred to as the dataset’s dimension. High-dimensional data refers to a dataset with a lot of details, typically on the order of 100 or more. The problem with high-dimensional data is that it’s hard to draw the correct conclusions. As the number of dimensions increases, it becomes easier to confuse noise for real correlations because the data error increases.
Table of Contents
A typical example of high-dimensional data is text data. When converting text to a numerical representation (or a vector), we give each particular word a number. See our article on tf-idf as to how to do this. As a result, every word becomes a feature. We quickly have more than one hundred features, so we are forced to deal with high-dimensional data. And as a result, we need to be aware of the curse of dimensionality before further processing or making any form of decisions based on the data.
High-dimensional data typically has more than a hundred data points.
What is the curse of dimensionality?
High-dimensional data presents several challenges when analyzing or visualizing the data to find patterns and develop machine learning models.
The “curse of dimensionality” says that:
The error grows with the number of features.
This alludes to the fact that high-dimensional machine learning algorithms are more challenging to design as patterns in the data are hard to distinguish from the noise in the data.
These algorithms frequently have running times that are exponentially related to the dimensions.
What domains are affected by the curse of dimensionality?
There are a lot of domains directly affected by the curse of dimensionality. For example, any field with data with many attributes would face this issue.
Natural Language Processing (NLP)
When working with textual data, we often turn text into vectors to get numerical input that can then be passed to a machine learning model. Unfortunately, turning text into numbers results in sparse datasets with complicated patterns. This results in the “curse of dimensionality”, which is an issue in most NLP solutions. To combat this, we often spend more time on feature engineering to reduce the number of features or use more data to increase the size of the data set.
Finding unexpected elements or events in a dataset requires anomaly detection. Anomalies in high-dimensional data frequently display numerous attributes unrelated to their actual nature.
For example, network traffic is monitored for threats and unusual activity in cyber security. But with so much activity originating from so many different sources, it is hard to distinguish “normal” activity from that which is a threat to the system.
To maintain the same level of performance in a machine learning model, a slight increase in dimensionality necessitates a significant increase in data volume. The opposite is also true. If we can reduce the number of features in our data set, we need to train our models on much fewer data. So when working on feature selection, it is crucial to stick to a good number of features that don’t lead to the curse of dimensionality.
How to combat the curse of dimensionality?
Dimensionality reduction allows us to cut the number of features and, therefore, also solve the curse of dimensionality. Dimensionality reduction transforms a high-dimensional space into a lower-dimensional space without changing its properties. As a result, this process reduces the number of input variables in a dataset. This process removes the additional variables making it easy for analysts to analyze the data, which helps algorithms produce faster and better results.
Dimensionality reduction selects which features to keep and which to discard.
Many dimensionality reduction algorithms broadly fall into two categories: “feature selection” or “feature extraction” techniques.
Feature selection techniques
In feature selection techniques, the attributes are tested to determine their value before being chosen or rejected. The methods for feature selection that are most frequently used are discussed below.
Low Variance Filter
This method disregards attributes with a very low variance after comparing the variance in the dataset’s distribution of all the features. As a result, fewer variable attributes will be assumed to have a nearly constant value and will not improve the model’s predictability.
High Correlation Filter
The pair-wise correlation between attributes is found using this method. One feature is dropped in the pairs with a very high correlation while the other is kept. As a result, the retained feature captures the variation in the eliminated attributes.
If each attribute is regressed as a function of the others, we may see that the others entirely capture the variability of some features. Sometimes, a high correlation may not be found for pairs of attributes. Multicollinearity is the term for this feature, and the variance inflation factor (VIF) is widely used to identify multicollinearity. High VIF values—generally greater than 10—eliminate attributes.
The attributes can be ranked according to their significance or contribution to the model’s predictability using decision tree models like CART. Some lower-rated variables in high-dimensional data may be removed to reduce the dimensions.
Feature Extraction Techniques
The high dimensional attributes are combined into low dimensional components (PCA or ICA) or factored into low dimensional factors in feature extraction techniques (FA).
Principal Component Analysis (PCA)
A dimensionality-reduction technique known as principal component analysis (PCA) transforms highly correlated, high-dimensional data into a set of uncorrelated, lower-dimensional components known as principal components. The lower-dimensional principal components capture the majority of the data in the high-dimensional dataset. A subset of these principal components is chosen based on the percentage of variance in the data intended to be captured through the principle components after n-dimensional data is transformed into n-principal components. A straightforward example of transforming 10-dimensional data into 10 principal components is when only 3 principal components are required to account for 90% of the variance in the data. As a result, it is possible to condense a 10-dimensional dataset into just 3.
Factor Analysis (FA)
A dataset’s observed attributes are all assumed to be able to be represented as a weighted linear combination of latent factors in factor analysis. This method’s underlying premise is that n dimensions of data can be represented by m factors (mn). The primary distinction between PCA and FA is that, whereas PCA builds components from the fundamental attributes, FA breaks down the attributes into latent factors.
Independent Component Analysis (ICA)
ICA resolves the variables into a combination of these independent components. It does this under the assumption that all attributes are a mixture of separate components. ICA is typically used when PCA and FA fail because it is thought to be more reliable than PCA.
Key Takeaways – Curse of Dimensionality
- When dealing with high-dimensional data, there are several issues known as the “Curse of Dimensionality.” First, the error grows with the number of features. Second, high-dimensional machine learning algorithms are more challenging to design because patterns are hard to distinguish from the noise in the data.
- Natural language processing, abnormality detection, and more general machine learning problems are the three main areas that are affected by the curse of dimensionality.
- Dimensionality reduction is transforming a high-dimensional data set into a lower-dimensional one. Dimensionality reduction reduces the number of input variables in a dataset. This makes it easier for analysts to analyze and more intuitive for algorithms to work with. In addition, the lack of additional variables makes analysis faster and more effective.
- There are two types of dimensionality reduction algorithms: “feature selection” and “feature extraction.” The main feature selection algorithms are; low variance filter, high correlation filter, Multicollinearity and feature ranking. While the most prominent feature extraction algorithms are; Principal Component Analysis (PCA), Factor Analysis (FA) and Independent Component Analysis (ICA).
- Feature selection is the process by which attributes are tested to determine their value before they are chosen or rejected. In contrast, feature extraction techniques focus on combining multiple different features back into a different, more rich set of more minor features.
The curse of dimensionality at Spot Intelligence
The curse of dimensionality is a genuine problem that needs to be carefully considered when developing machine learning models or doing an analysis. Without it, you can find all sorts of correlations in your data that aren’t significant or representative of your data. This will lead to inaccurate results or decisions being made on incorrect analysis.
At Spot Intelligence, we process text and use many natural language processing techniques. As we often transform text into vectors, we create a lot of high-dimensional data. This high-dimensional data suffers from the “curse of dimensionality.” So we, too, need to be very careful when processing our data.
A good pre-processing pipeline that optimizes the number of features for a given problem and data set helps us manage this problem. Working with data that has its feature space reduced helps remove the noise in our predictions and extractions. However, we also need to be careful with what we remove, as we don’t want to remove those features with predictive power.
Have you faced the curse of dimensionality in your projects? Have you heard of the curse of variability? What are your favourite techniques to combat the problem? We would love to hear about them in the comment section below.