Local Sensitive Hashing (LSH)— Complete Guide & How To Get Started In Python

by | Jan 16, 2023 | Machine Learning, Natural Language Processing

What is local sensitive hashing?

A technique for performing a rough nearest neighbour search in high-dimensional spaces is called local sensitive hashing (LSH). It operates by mapping high-dimensional data points to a lower-dimensional space where the distance between the points is roughly preserved using hash functions.

What is hahing?

Hashing is a method of generating a fixed-size string (the hash) from a variable-size input (the message or data). The same input will always produce the same hash, but even a small change to the input will produce a vastly different hash. Hashing is often used for data structures such as hash tables or as a way to check the integrity of data by comparing a stored hash value with a newly generated hash of the same data.

Additionally, it can be used for indexing, digital signature, password storage and more.

Comparing the hash values of data points rather than their original high-dimensional coordinates enables an adequate search for nearest neighbours.

Information retrieval and recommendation systems are two applications that frequently use LSH.

local sensitive hashing on a global scale

LSH is frequently used for recommendation systems.

local sensitive hashing example

A recommendation engine for a music streaming service can illustrate how Local Sensitive Hashing (LSH) is used. There are a lot of songs in the music library. Each song is a high-dimensional feature vector with information about the artist, genre, tempo, and other things.

We can use LSH to map the high-dimensional feature vectors to a lower-dimensional space where the distance between songs is roughly preserved to find songs similar to a given song.

Here’s how it operates:

  1. The first step would be to select a set of hash functions that can translate the high-dimensional feature vectors into a lower-dimensional space. For example, we could use random projection functions that take a feature vector as an input and return a binary value.
  2. Then, we would use these hash functions to create a set of hash values for each song in the library.
  3. We would then create a table with each hash value as the key and the corresponding songs as the value.
  4. We would use the same hash functions on the feature vector of the given song and look up the corresponding hash values in the hash table to find similar songs. Songs that share a hash value are regarded as being similar.
  5. We can repeat the procedure with various hash functions to obtain multiple hash values for each song, increasing the likelihood of discovering related songs. This is possible by repeating the hash functions and using multiple hash tables.

Given that LSH is an approximative nearest neighbour search algorithm, the returned results will likely be similar to or close to the query point rather than the exact nearest neighbours.

Advantages of local sensitive hashing

Using Local Sensitive Hashing (LSH) for an approximative nearest neighbour search has many benefits:

  1. Efficiency: LSH can significantly lower the computational cost of a nearest neighbour search in high-dimensional spaces. LSH enables effective to search by comparing the hash values of data points rather than their original high-dimensional coordinates and mapping high-dimensional data points to a lower-dimensional space.
  2. Scalability: LSH can handle high-dimensional feature vectors and massive datasets with millions of data points. It is also easy to run in parallel, making it a good choice for distributed computing environments.
  3. Robustness to Noise: LSH is resistant to noise and data errors because it relies on the approximate matching of hash values rather than the exact matching of high-dimensional coordinates.
  4. Handling High-dimensional Data: Text, images, and videos are examples of high-dimensional data that LSH excels at handling because they can be challenging to work with using conventional techniques.
  5. Handling Sparse Data: LSH can also work with sparse data, which is common in applications like text classification, where the feature vectors often have zero entries.
  6. Flexibility: LSH is flexible and straightforward to modify to suit various requirements and applications. It can be used with multiple distance units.
  7. Memory-efficient: Due to the small memory requirements of the hash table and hash function coefficients, LSH is memory-efficient.
  8. Easy to use and understand: LSH is an intuitive and straightforward algorithm. This implies that researchers and professionals who might need to become more familiar with sophisticated mathematical techniques can use them.

Disadvantages of local sensitive hashing

Although Local Sensitive Hashing (LSH) has a lot of benefits, there are some drawbacks and restrictions to take into account:

  1. Approximate results: Being an approximative nearest neighbour search algorithm, LSH may not always return the exact nearest neighbours, only those that are similar to or near the query point. Using multiple hash tables and hash functions can increase the accuracy of the results, but doing so also raises the computation cost.
  2. Sensitivity to Hyperparameters: The number of hash tables, the number of hash operations, and the size of the hash buckets all affect how LSH works.
  3. Limited to specific data types: LSH may not work well when data points are spread randomly in high-dimensional space. It works best when the data points are on or near a low-dimensional subspace of the high-dimensional space.
  4. Collision Problem: Collisions, when two distinct data points are mapped to the same hash value, can cause false positives in LSH.
  5. Not suitable for all queries: LSH is intended for nearest neighbour searches but is inappropriate for range or k-nearest neighbour queries.
  6. Not suitable for all types of distance measures: LSH is based on the not-always-true presumption that the probability distribution of the distance measure is stable.
  7. It may not always be the best approach: Although LSH is an effective technique, other methods, such as tree-based indexing or random projection, might be better suited for a given problem.

LSH time complexity

The time complexity of Local Sensitive Hashing (LSH) depends on several factors, including the number of data points, the dimensionality of the feature vectors, the number of hash functions and hash tables used, and the size of the hash buckets.

Hash function mapping usually takes O(n * d) time, where n is the number of data points and d is the number of dimensions in the feature vectors.

The lookup step of a hash table usually takes O(1) time, assuming that the hash table is made with a data structure like a hash map, which has a constant lookup time complexity.

So, the overall LSH algorithm usually takes O(n * d + k) time, where k is the number of hash tables and hash functions used.

However, to improve the recall of the algorithm, multiple hash tables and hash functions are used, which increases the time complexity to O(n * d * k) .

It’s also important to note that the time complexity of LSH depends a lot on how the data is organised and which hash functions are used. This means that it can be different depending on the application and dataset.

local sensitive hashing vs clustering

Both clustering and local sensitive hashing (LSH) are techniques for assembling related data points, but they differ significantly in several important ways.

  1. Purpose: While clustering collects related data points into groups, LSH performs an approximate nearest neighbour search. While clustering compiles similar items and assigns them to a cluster, LSH is used to find comparable items and return them.
  2. Time complexity: LSH has a time complexity of O(n * d * k) where n is the number of data points, d is the dimensionality of the feature vectors, and k is the number of hash tables and hash functions used. Clustering algorithms, on the other hand, have varying time complexities depending on the specific algorithm used. For example, k-means clustering has a time complexity of O(n * k * I * d) , where I is the number of iterations and k is the number of clusters.
  3. Quality of results: Clustering algorithms return a set of clusters that meet many criteria, such as minimising the within-cluster variance. Being an approximative nearest neighbour search algorithm, LSH may not always yield the exact nearest neighbours, only those that are similar to or near the query point.
  4. Types of data and distance measures: Based on the presumption that the data points are located on or close to a low-dimensional subspace of the high-dimensional space, LSH uses a stable probability distribution as the distance metric. On the other hand, clustering can be used with a wide range of data types and distance measures.
  5. Scalability: With millions of data points and high-dimensional feature vectors, LSH is comparatively scalable and can handle large datasets. It is also simple to parallelise. Clustering algorithms may not scale, especially for large and high-dimensional data.

In conclusion, LSH and clustering are complementary techniques; LSH can locate similar items quickly, while clustering can collect similar items and assign them to a cluster.

LSH in Python

Local Sensitive Hashing (LSH) can be implemented in Python using the following steps:

  1. Install the necessary libraries: You will need to instal the following libraries: NumPy, Scipy, and Datasketch. These libraries provide functions for working with arrays, matrices, and hash tables.
  2. Choose a set of hash functions: You will need to choose a set of hash functions that can map the high-dimensional feature vectors to a lower-dimensional space. The datasketch library has several hash functions, like MinHash and LSHForest, that can be used for this.
  3. Create the hash tables: You will need to create one or more hash tables where the keys are the hash values, and the values are the corresponding data points. The datasketch library provides a HashTable class that can be used to create and manage hash tables.
  4. Hash the data points: You will need to apply the chosen hash functions to the feature vectors of the data points and add the resulting hash values to the suitable hash tables.
  5. Perform a nearest neighbour search: To find similar data points to a given query, you will need to apply the same hash functions to the query’s feature vector and look up the corresponding hash values in the hash tables. Any data points that have the same hash value are considered to be similar.

Code example

Here is an example of how to use the datasketch library in Python to create a MinHash LSH for finding similar items,

from datasketch import MinHash
from datasketch import MinHashLSH

# Create an LSH index
minhash = MinHash(num_perm=128)
lsh = MinHashLSH(threshold=0.5, num_perm=128)

# Add items to the index
minhash.update("item1".encode('utf8'))
lsh.insert("item1", minhash)

minhash.update("item2".encode('utf8'))
lsh.insert("item2", minhash)

# Find similar items
result = lsh.query(minhash)
print(result)

This is just a simple example. In real-world scenarios, you need to create a minhash for each item and update the minhash with the corresponding feature vector.

Conclusion

Local sensitive hashing (LSH) is a robust method for performing a rough nearest neighbour search in high-dimensional space. It uses hash functions to turn high-dimensional data points into points in a lower-dimensional space with roughly the same distance between them.

Comparing the hash values of data points rather than their original high-dimensional coordinates enables an adequate search for nearest neighbours.

Efficiency, scalability, robustness to noise, handling high-dimensional data, handling sparse data, and flexibility are just a few benefits of LSH.

However, it has some drawbacks, including approximate results, hyperparameter sensitivity, data type restrictions, collision problems, unsuitability for all queries, and distance measures.

With the help of libraries like NumPy, scipy, and datasketch, it can be implemented in Python.

LSH and clustering are complementary techniques; LSH helps find similar items quickly, while clustering helps assemble similar items and assign them to a cluster.

About the Author

Neri Van Otten

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.

Recent Articles

ROC curve

ROC And AUC Curves In Machine Learning Made Simple & How To Tutorial In Python

What are ROC and AUC Curves in Machine Learning? The ROC Curve The ROC (Receiver Operating Characteristic) curve is a graphical representation used to evaluate the...

decision boundaries for naive bayes

Naive Bayes Classification Made Simple & How To Tutorial In Python

What is Naive Bayes? Naive Bayes classifiers are a group of supervised learning algorithms based on applying Bayes' Theorem with a strong (naive) assumption that every...

One class SVM anomaly detection plot

How To Implement Anomaly Detection With One-Class SVM In Python

What is One-Class SVM? One-class SVM (Support Vector Machine) is a specialised form of the standard SVM tailored for unsupervised learning tasks, particularly anomaly...

decision tree example of weather to play tennis

Decision Trees In ML Complete Guide [How To Tutorial, Examples, 5 Types & Alternatives]

What are Decision Trees? Decision trees are versatile and intuitive machine learning models for classification and regression tasks. It represents decisions and their...

graphical representation of an isolation forest

Isolation Forest For Anomaly Detection Made Easy & How To Tutorial

What is an Isolation Forest? Isolation Forest, often abbreviated as iForest, is a powerful and efficient algorithm designed explicitly for anomaly detection. Introduced...

Illustration of batch gradient descent

Batch Gradient Descent In Machine Learning Made Simple & How To Tutorial In Python

What is Batch Gradient Descent? Batch gradient descent is a fundamental optimization algorithm in machine learning and numerical optimisation tasks. It is a variation...

Techniques for bias detection in machine learning

Bias Mitigation in Machine Learning [Practical How-To Guide & 12 Strategies]

In machine learning (ML), bias is not just a technical concern—it's a pressing ethical issue with profound implications. As AI systems become increasingly integrated...

text similarity python

Full-Text Search Explained, How To Implement & 6 Powerful Tools

What is Full-Text Search? Full-text search is a technique for efficiently and accurately retrieving textual data from large datasets. Unlike traditional search methods...

the hyperplane in a support vector regression (SVR)

Support Vector Regression (SVR) Simplified & How To Tutorial In Python

What is Support Vector Regression (SVR)? Support Vector Regression (SVR) is a machine learning technique for regression tasks. It extends the principles of Support...

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

nlp trends

2024 NLP Expert Trend Predictions

Get a FREE PDF with expert predictions for 2024. How will natural language processing (NLP) impact businesses? What can we expect from the state-of-the-art models?

Find out this and more by subscribing* to our NLP newsletter.

You have Successfully Subscribed!