Top 7 Ways To Implement Document & Text Similarity In Python: NLTK, Scikit-learn, BERT, RoBERTa, FastText and PyTorch

by | Dec 19, 2022 | Machine Learning, Natural Language Processing

Text similarity is a really useful natural language processing (NLP) tool. It allows you to find similar pieces of text and has many real-world use cases. This article discusses text similarity, its use, and the most common algorithms that implement it. We then deep dive into the six most popular machine learning and deep learning packages that implement text similarity in Python. The code examples will get you started implementing text similarity right away.

What is text similarity?

Text similarity measures how much the meaning or content of two pieces of text are the same. It measures the degree to which two texts are semantically related. There are many ways to measure text similarity, including techniques such as cosine similarity, Levenshtein distance, and the Jaccard index. These techniques can be used to compare texts in different languages or within the same language.

text similarity python

Text similarity can be used to find similar information.

Text similarity can be used in many ways to find information, process natural language, and translate between languages automatically. For example, a search engine might use text similarity to rank search results based on their relevance to the query. In natural language processing, text similarity can be used to identify synonyms or generate text similar in style or meaning to a given text. Text similarity can be used in machine translation to find similar translations to the source text.

What is text similarity used for?

Some common use cases include:

  1. Plagiarism detection: Text similarity can be used to identify instances of plagiarism by comparing the similarity of a piece of text to other known texts.
  2. Document classification: Text similarity can be used to classify documents based on their content. For example, a document classification system might use text similarity to determine whether a document is relevant to a particular topic.
  3. Information retrieval: Text similarity can be used to identify relevant documents in a search engine or database. For example, a search query for “car” might return documents that contain similar words or phrases, such as “automobile” or “vehicle.”
  4. Language translation: Text similarity can be used to improve the accuracy of machine translation systems by comparing the similarity of a translated text to the original text.
  5. Sentiment analysis: Text similarity can be used to identify the sentiment of a piece of text by comparing it to a set of pre-classified texts with known sentiments.
  6. Summarization: Text similarity can be used to summarise a document by identifying the most important sentences and phrases within the document.

Different text similarity algorithms

Many different algorithms can be used to measure text similarity. Some common ones include:

1. Cosine similarity

This measures the similarity between two texts based on the angle between their word vectors. It is often used with term frequency-inverse document frequency (TF-IDF) vectors, representing each word’s importance in a document.

Cosine similarity measures the similarity between two non-zero vectors of an inner product space. In the context of document similarity, it is often used to measure the similarity between two documents represented as vectors of word frequencies. The cosine similarity between two vectors is calculated as the cosine of the angle between them.

To compute the cosine similarity between two documents, first, a vector representation of each document is constructed, where each dimension of the vector corresponds to a word in the document, and the value of the dimension represents the frequency of that word in the document. The vectors are then normalized to have a unit length. The cosine similarity between the two documents is then calculated as the dot product of the two vectors divided by the product of their lengths.

The resulting cosine similarity value ranges from -1 to 1, where -1 indicates completely dissimilar documents, and 1 indicates identical documents. A value of 0 indicates that the two documents are orthogonal and have no similarity.

Cosine similarity is widely used in natural language processing and information retrieval, particularly in document clustering, classification, and recommendation systems.

2. Levenshtein distance

Levenshtein distance, or edit distance, measures the difference between two strings. It is the minimum number of single-character insertions, deletions, or substitutions required to transform one string into another.

For example, the Levenshtein distance between “kitten” and “sitting” is 3, since three single-character edits are required to transform “kitten” into “sitting”: substitute “s” for “k”, substitute “i” for “e”, and insert “g” at the end.

Levenshtein distance is used in various applications such as spell-checking, string matching, and DNA analysis.

3. Jaccard index

The Jaccard index, or the Jaccard similarity coefficient, measures the similarity between two sets. It is defined as the ratio of the size of the intersection of the sets to the size of the union of the sets. In other words, it is the proportion of common elements between two sets.

The Jaccard index is particularly useful when the presence or absence of elements in the sets is more important than their frequency or order. For example, it can be used to compare the similarity of two documents by considering the sets of words that appear in each document.

The Jaccard index is calculated as follows:

J(A,B) = |A ∩ B| / |A ∪ B|

where A and B are sets, and |A| and |B| represent the cardinality or size of the sets.

The resulting value of the Jaccard index ranges from 0 to 1, where 0 indicates no common elements between the sets, and 1 indicates that the sets are identical.

The Jaccard index is widely used in various applications such as information retrieval, data mining, and pattern recognition. It is particularly useful when dealing with sparse or high-dimensional data, where the presence or absence of features is more important than their actual values.

4. Euclidean distance

Euclidean distance is a measure of the distance between two points in a Euclidean space. It is calculated as the square root of the sum of the squares of the differences between the corresponding coordinates of the two points.

For example, the Euclidean distance between two points (x1, y1) and (x2, y2) in a two-dimensional space is given by:

euclidean distance = sqrt((x2 - x1)^2 + (y2 - y1)^2)

The Euclidean distance can be extended to spaces of any dimension. It is commonly used in machine learning and data analysis to measure the similarity between two vectors in a high-dimensional space.

In the context of document similarity, the Euclidean distance can be used to compare the frequency of words in two documents represented as vectors of word frequencies. In this case, the Euclidean distance between the two vectors is calculated as the square root of the sum of the squared differences between the corresponding frequency values in the two vectors.

The resulting value of Euclidean distance ranges from 0 to infinity, where 0 indicates identical vectors and larger values indicate greater dissimilarity between the vectors.

Euclidean distance is widely used in various applications such as clustering, classification, and anomaly detection. It is particularly useful when dealing with continuous variables or data that can be represented as vectors in a high-dimensional space.

5. Hamming distance

Hamming distance measures the difference between two strings of equal length. It is defined as the number of positions at which the corresponding symbols differ. In other words, it is the minimum number of single-character substitutions required to transform one string into another of equal length.

For example, the Hamming distance between “101010” and “111011” is 2, since two positions differ between the two strings: the second and fifth.

Hamming distance is used in various applications such as error-correcting codes, coding theory, and cryptography. It can also be used to compare the similarity of binary strings, such as DNA sequences.

In computer science, Hamming distance is often used as a metric to measure the quality of codes. For example, in error-correcting codes, the minimum Hamming distance between codewords determines the number of errors that can be corrected by the code. Codes with a larger minimum Hamming distance are more robust to errors.

The Hamming distance can be calculated using a simple algorithm that compares the symbols at each position in the two strings and counts the number of positions where they differ.

6. Word embeddings

Word embeddings are distributed representations of words in a natural language. They represent words as vectors of real numbers, where each vector dimension represents a different feature or aspect of the word’s meaning. Word embeddings are often fundamental in many natural language processing tasks, such as machine translation, text classification, and information retrieval.

Word embeddings are typically learned from large corpora of text data using neural network models, such as the famous Word2Vec model or GloVe. These models map words to a high-dimensional space where semantically similar words are mapped to nearby points. The learned embeddings capture both the syntactic and semantic relationships between words and can capture complex analogies and relationships between words.

Word embeddings have several advantages over traditional methods for representing words in natural language processing, such as one-hot encoding or continuous bag-of-words representations. For example:

  • They are dense, meaning they are more space-efficient than sparse representations like one-hot encoding.
  • They can capture semantic relationships between words that cannot be easily captured by traditional methods.
  • They can infer relationships between words or generate new representations of words not seen in the training data.

Many pre-trained word embeddings are available, which can be used for various NLP tasks. Additionally, custom word embeddings can be trained on specific domains or datasets to improve performance on specific tasks.

7. Pre-trained language models

Pre-trained language models are powerful tools for text similarity tasks, as they can learn high-quality representations of text that capture both semantic and syntactic information. Here are some of the most widely used pre-trained language models for text similarity tasks:

  1. BERT (Bidirectional Encoder Representations from Transformers): BERT is a transformer-based pre-trained language model widely used for various natural language processing tasks, including text similarity. It has been shown to outperform previous state-of-the-art methods on several benchmark datasets.
  2. RoBERTa (Robustly Optimized BERT Pretraining Approach): RoBERTa is a variant of BERT that is pre-trained using additional data and training strategies. It has achieved state-of-the-art performance on several text similarity benchmarks.
  3. DistilBERT: DistilBERT is a smaller and faster version of BERT trained using a knowledge distillation technique. It has achieved competitive performance on several text similarity benchmarks much faster than BERT.
  4. USE (Universal Sentence Encoder): USE is a pre-trained model developed by Google that can encode sentences into fixed-length vectors. It can be used for text similarity tasks by computing the cosine similarity between the sentence embeddings.
  5. ALBERT (A Lite BERT): ALBERT is a variant of BERT that reduces the number of parameters and improves training efficiency while maintaining comparable performance.

These pre-trained language models can be fine-tuned on specific text similarity tasks using transfer learning, which involves training the model on a smaller dataset of labelled examples. Fine-tuning can further improve the performance of these models on specific tasks.

How to implement text similarity in Python?

1. Text similarity with NLTK

There are several ways to find text similarity in Python. One way is to use the Python Natural Language Toolkit (NLTK), a popular library for natural language processing tasks.

Here is an example of how to use NLTK to calculate the cosine similarity between two pieces of text:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def text_similarity(text1, text2):
    # Tokenize and lemmatize the texts
    tokens1 = word_tokenize(text1)
    tokens2 = word_tokenize(text2)
    lemmatizer = WordNetLemmatizer()
    tokens1 = [lemmatizer.lemmatize(token) for token in tokens1]
    tokens2 = [lemmatizer.lemmatize(token) for token in tokens2]

    # Remove stopwords
    stop_words = stopwords.words('english')
    tokens1 = [token for token in tokens1 if token not in stop_words]
    tokens2 = [token for token in tokens2 if token not in stop_words]

    # Create the TF-IDF vectors
    vectorizer = TfidfVectorizer()
    vector1 = vectorizer.fit_transform(tokens1)
    vector2 = vectorizer.transform(tokens2)

    # Calculate the cosine similarity
    similarity = cosine_similarity(vector1, vector2)

    return similarity

This code first tokenizes and lemmatizes the texts removes stopwords, and then creates TF-IDF vectors for the texts. Finally, it calculates the cosine similarity between the vectors using the cosine_similarity function from sklearn.metrics.pairwise .

2. Text similarity with Scikit-Learn

Scikit-learn is a popular Python library for machine learning tasks, including text similarity. To find similar texts with Scikit-learn, you can first use a feature extraction method like term frequency-inverse document frequency (TF-IDF) to turn the texts into numbers. You can then use a similarity measure such as cosine similarity to compare the texts.

Here is an example of how you might do this using Scikit-learn:

import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Convert the texts into TF-IDF vectors
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform([text1, text2])

# Calculate the cosine similarity between the vectors
similarity = cosine_similarity(vectors)
print(similarity)

This code uses the TfidfVectorizer class to convert the texts into TF-IDF vectors, and then uses the cosine_similarity function from sklearn.metrics.pairwise to calculate the cosine similarity between the vectors.

Alternatively, you can use other feature extraction methods such as bag-of-words or word embeddings and other similarity measures such as Euclidean distance or the Jaccard index.

3. Text similarity with BERT

To find text similarity with BERT, you can fine-tune a BERT model on a text similarity task such as sentence or document similarity. Then, you can use the fine-tuned model to make embeddings for the texts you want to compare and use the cosine similarity between the embeddings to measure how similar the texts are.

Here is an example of how you might do this using the transformers library in Python:

import transformers

# Load the BERT model
model = transformers.BertModel.from_pretrained('bert-base-uncased')

# Tokenize and encode the texts
text1 = "This is the first text."
text2 = "This is the second text."
encoding1 = model.encode(text1, max_length=512)
encoding2 = model.encode(text2, max_length=512)

# Calculate the cosine similarity between the embeddings
similarity = numpy.dot(encoding1, encoding2) / (numpy.linalg.norm(encoding1) * numpy.linalg.norm(encoding2))
print(similarity)

This code loads a pre-trained BERT model, tokenizes and encodes the texts using the encode method, and then calculates the cosine similarity between the embeddings using the dot product and the L2 norms of the embeddings.

Alternatively, you can use a fine-tuned BERT model trained specifically for text similarity. In this case, you would use the predict method of the model to generate embeddings for the texts and then calculate the cosine similarity as before.

4. Text similarity with RoBERTa

To find text similarity with RoBERTa, you can fine-tune a RoBERTa model on a text similarities task such as sentence or document similarity. You can then use the fine-tuned model to generate embeddings for the texts you want to compare and calculate the cosine similarity between the embeddings as a measure of text similarity.

Here is an example of how you might do this using the transformers library in Python:

import transformers

# Load the RoBERTa model
model = transformers.RobertaModel.from_pretrained('roberta-base')

# Tokenize and encode the texts
text1 = "This is the first text."
text2 = "This is the second text."
encoding1 = model.encode(text1, max_length=512)
encoding2 = model.encode(text2, max_length=512)

# Calculate the cosine similarity between the embeddings
similarity = numpy.dot(encoding1, encoding2) / (numpy.linalg.norm(encoding1) * numpy.linalg.norm(encoding2))
print(similarity)

This code loads a pre-trained RoBERTa model, tokenizes and encodes the texts using the encode method, and then calculates the cosine similarity between the embeddings using the dot product and the L2 norms of the embeddings.

Alternatively, you can use a fine-tuned RoBERTa model trained specifically for text similarity. In this case, you would use the predict method of the model to generate text embeddings, and then calculate the cosine similarity as before.

5. Text similarity with FastText

FastText is another excellent library for efficiently learning word representations and sentence classification. It can be used to find out how similar two pieces of text are by representing each piece of text as a vector and comparing the vectors using a similarity metric like cosine similarity.

To find the similarity between two pieces of text using FastText, you can follow these steps:

Here is an example of how to find the similarity between two pieces of text using FastText in Python:

import fasttext

# Load the FastText model
model = fasttext.load_model('cc.en.300.bin')

# Preprocess the text
text1 = 'This is a piece of text'
text2 = 'This is another piece of text'
tokens1 = fasttext.tokenize(text1)
tokens2 = fasttext.tokenize(text2)
tokens1 = [token.lower() for token in tokens1]
tokens2 = [token.lower() for token in tokens2]

# Generate word vectors for each piece of text
vector1 = model.get_sentence_vector(tokens1)
vector2 = model.get_sentence_vector(tokens2)

# Calculate the similarity between the vectors using cosine similarity
from scipy.spatial.distance import cosine
similarity = 1 - cosine(vector1, vector2)
print('Similarity:', similarity)

This will output the similarity between the two pieces of text, with a value of 1 indicating that the texts are identical and a value of 0 indicating that they are entirely dissimilar.

6. Text similarity with PyTorch

Here is an example of how you can calculate text similarity using PyTorch:

import torch

# Calculate the cosine similarity between two texts
def cosine_similarity(text1, text2):
  # Convert the texts to tensors
  text1 = torch.tensor([text1])
  text2 = torch.tensor([text2])

  # Calculate the dot product of the texts
  dot_product = torch.matmul(text1, text2.transpose(1, 0))

  # Calculate the norms of the texts
  norm1 = torch.norm(text1, dim=1)
  norm2 = torch.norm(text2, dim=1)

  # Calculate the cosine similarity
  cosine_similarity = dot_product / (norm1 * norm2)

  return cosine_similarity

# Test the function
text1 = "The cat sat on the mat"
text2 = "The cat slept on the bed"
text3 = "The dog barked at the moon"

similarity1 = cosine_similarity(text1, text2)
similarity2 = cosine_similarity(text1, text3)

print(f"Similarity between text1 and text2: {similarity1:.2f}")
print(f"Similarity between text1 and text3: {similarity2:.2f}")
# Similarity between text1 and text2: 0.79
# Similarity between text1 and text3: 0.20

This example calculates the cosine similarity between two texts by converting them to tensors, calculating the dot product of the texts, and then dividing by the product of their norms. The resulting value measures the text’s similarity, with higher values indicating greater similarity.

Closing thoughts

Text similarity is a really popular NLP technique with many significant use cases. Many libraries in Python have ready-to-use implementations. What algorithm you choose will probably depend greatly on which tools you already use for your text pre-processing and what other modelling techniques you plan to use in your application.

Suppose you have scaling issues with the similarity algorithms described in this article. In that case, you will want to check out our SimHash article, as this is an excellent technique for detecting similarities with a lot of data.

What is your favourite NLP library or model for text and document similarity? Let us know in the comments.

About the Author

Neri Van Otten

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.

Recent Articles

One class SVM anomaly detection plot

How To Implement Anomaly Detection With One-Class SVM In Python

What is One-Class SVM? One-class SVM (Support Vector Machine) is a specialised form of the standard SVM tailored for unsupervised learning tasks, particularly anomaly...

decision tree example of weather to play tennis

Decision Trees In ML Complete Guide [How To Tutorial, Examples, 5 Types & Alternatives]

What are Decision Trees? Decision trees are versatile and intuitive machine learning models for classification and regression tasks. It represents decisions and their...

graphical representation of an isolation forest

Isolation Forest For Anomaly Detection Made Easy & How To Tutorial

What is an Isolation Forest? Isolation Forest, often abbreviated as iForest, is a powerful and efficient algorithm designed explicitly for anomaly detection. Introduced...

Illustration of batch gradient descent

Batch Gradient Descent In Machine Learning Made Simple & How To Tutorial In Python

What is Batch Gradient Descent? Batch gradient descent is a fundamental optimization algorithm in machine learning and numerical optimisation tasks. It is a variation...

Techniques for bias detection in machine learning

Bias Mitigation in Machine Learning [Practical How-To Guide & 12 Strategies]

In machine learning (ML), bias is not just a technical concern—it's a pressing ethical issue with profound implications. As AI systems become increasingly integrated...

text similarity python

Full-Text Search Explained, How To Implement & 6 Powerful Tools

What is Full-Text Search? Full-text search is a technique for efficiently and accurately retrieving textual data from large datasets. Unlike traditional search methods...

the hyperplane in a support vector regression (SVR)

Support Vector Regression (SVR) Simplified & How To Tutorial In Python

What is Support Vector Regression (SVR)? Support Vector Regression (SVR) is a machine learning technique for regression tasks. It extends the principles of Support...

Support vector Machines (SVM) work with decision boundaries

Support Vector Machines (SVM) In Machine Learning Made Simple & How To Tutorial

What are Support Vector Machines? Machine learning algorithms transform raw data into actionable insights. Among these algorithms, Support Vector Machines (SVMs) stand...

underfitting vs overfitting vs optimised fit

Weight Decay In Machine Learning And Deep Learning Explained & How To Tutorial

What is Weight Decay in Machine Learning? Weight decay is a pivotal technique in machine learning, serving as a cornerstone for model regularisation. As algorithms...

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

nlp trends

2024 NLP Expert Trend Predictions

Get a FREE PDF with expert predictions for 2024. How will natural language processing (NLP) impact businesses? What can we expect from the state-of-the-art models?

Find out this and more by subscribing* to our NLP newsletter.

You have Successfully Subscribed!