How To Implement Lemmatization In Python [SpaCy, NLTK & Gensim]

Lemmatization is the conversion of a word to its base form or lemma. This differs from stemming, which takes a word down to its root form by removing its prefixes and suffixes. Lemmatization, on the other hand, considers the context and meaning of a word and tries to convert it to a more meaningful and easier-to-work format.

For example, the words “was,” “is,” and “will be” can all be lemmatized to the word “be.” Similarly, the words “better” and “best” can be lemmatized to the word “good.”

Lemmatization reduces the text to its root, making it easier to find keywords.

What is lemmatization?

Lemmatization is commonly used in natural language processing (NLP) and information retrieval applications, where it can improve the accuracy and performance of text analysis and search algorithms. By converting words to their base form, lemmatization can reduce the dimensionality of the text data and allow the algorithms to focus on the most critical and relevant information in the text.

There are many different tools and libraries available for performing lemmatization in Python. Some popular examples include NLTK, SpaCy, and Gensim. To use these libraries for lemmatization, you will typically need first to tokenize the text into individual words and then apply the lemmatization function to each token.

What is stemming?

Stemming reduces a word to its root form, typically by removing prefixes and suffixes. This is a common technique used in natural language processing (NLP) and information retrieval applications, where it can help reduce the complexity and noise in the text data and make it easier to work with.

For example, the words “aggressively,” “aggressiveness,” and “aggressor” can all be stemmed to the word “aggress,” which is the root form of these words. Similarly, the words “universities” and “university” can be stemmed to the word “univers,” which is the root form of these words.

Difference between stemming and lemmatization

Lemmatization and stemming reduce words to their base form or root form. However, there are some essential differences between these two approaches.

Stemming typically involves removing prefixes and suffixes from a word and sometimes even modifying its internal structure to reduce it to its root form. This can be a simple and efficient way to normalize text, but it often produces words that need to be validated or meaningful. For example, “aggressively” might stem from “aggress,” which is invalid in English.

Lemmatization, on the other hand, considers the context and meaning of a word and tries to convert it to a form that is more meaningful and easier to work with. This typically involves using a vocabulary and morphological analysis of the words to identify the lemma of each word. For example, the words “was,” “is,” and “will be” can all be lemmatized to the word “be,” and the words “better” and “best” can be lemmatized to the word “good.”

In general, lemmatization is more sophisticated and accurate than stemming but can also be more computationally expensive. Whether to use stemming, lemmatization, or a combination of both depends on your application’s specific requirements and goals.

Practical use cases of lemmatization

Lemmatization is a technique to reduce words to their base form, or lemma. This can be useful in many natural language processing (NLP) and information retrieval applications, improving the accuracy and performance of text analysis and search algorithms.

Some practical applications of lemmatization include:

Text classification: Lemmatization can help improve the performance of text classification algorithms and keyword extraction by reducing the number of unique words in the text and making the text more consistent and coherent. This can make it easier for the algorithm to identify the most important features and patterns in the text and assign it to the correct category or label.
Sentiment analysis: Lemmatization can help improve the accuracy of sentiment analysis algorithms by converting words with multiple forms, such as “good,” “better,” and “best,” to their base form, “good.” This can reduce the noise and variability in the text and allow the algorithm to focus on the most relevant and informative words and phrases.
Topic modelling: Lemmatization can improve the quality and interpretability of topic models by reducing the dimensionality of the text data and removing irrelevant and redundant words. This can make the topics more coherent and meaningful and make it easier to understand and visualize the main themes and ideas in the text.
Information retrieval: Lemmatization can help improve the accuracy and relevance of search results by converting query terms and document words to their base form. This can reduce the number of false positives and false negatives and search results more consistent and relevant to the user’s needs.

Overall, lemmatization can be valuable in many NLP and information retrieval applications. For example, it can help reduce the complexity and noise in the text data and improve the performance and accuracy of text analysis and search algorithms.

How to implement Lemmatization in Python

You can use one of the many natural language processing (NLP) libraries available to perform lemmatisation in Python. Some popular examples include NLTK, spaCy, and Gensim.

1. NLTK Lemmatizer

Note that you will need to first install NLTK and download its WordNet data before running this example. You can do this by running the following commands in your Python interpreter:

import nltk nltk.download('wordnet')

Here is an example of how you can use NLTK for lemmatization in Python:

import nltk 
from nltk.stem import WordNetLemmatizer 

# Define a text string 
text = "This is a sample text. It contains some words that we can use for lemmatization." 

# Tokenize the text into individual words 
tokens = nltk.word_tokenize(text) 

# Create a WordNetLemmatizer object 
lemmatizer = WordNetLemmatizer() 

# Lemmatize each word and print the result 
for token in tokens: 
  lemma = lemmatizer.lemmatize(token) 
  print(token, "-->", lemma)

In this example, the WordNetLemmatizer class from NLTK will lemmatize each word in the text and print the result. For example, the word “contains” will be lemmatized to “contain,” and the word “words” will be lemmatized to “word.”

2. SpaCy Lemmatizer

Alternatively, you can use the SpaCy library for lemmatization in Python.

First install spaCy and download its English language model before running this example. You can do this by running the following commands on the command line:

pip install spacy
python -m spacy download en

Here is the example Python code:

import spacy 

# Define a text string 
text = "This is a sample text. It contains some words that we can use for lemmatization." 

# Load the English language model in spaCy 
nlp = spacy.load('en') 

# Create a Doc object 
doc = nlp(text) 

# Lemmatize each token and print the result 
for token in doc: 
  lemma = token.lemma_ 
  print(token.text, "-->", lemma)

In this example, the lemma_ property of each token in the spaCy Doc object will contain the lemma of the word. For example, the word “contains” will be lemmatized to “contain,” and the word “words” will be lemmatized to “word.”

3. Gensim Lemmatizer

You can also use the Gensim library for lemmatization in Python. Here is an example:

import gensim 
from gensim.utils import lemmatize 

# Define a text string 
text = "This is a sample text. It contains some words that we can use for lemmatization." 

# Use the lemmatize() function to lemmatize the text 
lemmas = lemmatize(text, stopwords=['is', 'it', 'we']) 

# Print the result 
print(lemmas)

Alternatives to Lemmatization

Lemmatization has limitations, such as its computational complexity and the need for an extensive vocabulary and morphological analysis of the words. Sometimes, these limitations make lemmatization impractical or unsuitable for your application.

If you are looking for alternative approaches to lemmatization, some common options include:

Stemming involves removing prefixes and suffixes from a word and sometimes modifying its internal structure to reduce it to its root form. Stemming is a more straightforward and efficient method than lemmatization, but it often produces words that need to be validated or meaningful.
Synonym mapping involves replacing each word with a pre-defined synonym or set of synonyms. This can reduce the number of unique words in the text and make the text more consistent and easier to work with. However, it can also reduce the richness and diversity of the text and may only be suitable for some applications.
Dimensionality reduction: This involves using mathematical techniques, such as singular value decomposition (SVD) or non-negative matrix factorization (NMF), to reduce the number of dimensions in the text data. This can help identify the most important and relevant features in the text and make the data more manageable and efficient to work with. However, it can also lose some information and context in the process.

Ultimately, the approach chosen will depend on your application’s specific requirements and goals. Therefore, it may be necessary to experiment with different methods and techniques to find the best solution for your needs.

Key Takeaways

Lemmatization is one of the top 10 most useful NLP techniques for a reason. It’s helpful to reduce the dimensionality of your feature space before doing any machine learning. It’s, therefore, a vital part of many pre-processing pipelines.

Stemming is the main alternative to lemmatization. It yields less accurate results but is computationally faster. The main drawback of stemming is that it can sometimes create unmeaningful results, which makes further analysis impossible.

There are many practical use cases of lemmatization. It’s essential in text classification, sentiment analysis, topic modelling and information retrieval.

Python has some great libraries with lemmatization implementations. SpaCy, NLTK and Gensim are the main ones.

For some applications, lemmatization may be too slow. Alternatives to consider are stemming, synonym mapping and dimensionality reduction.

At Spot Intelligence, we use lemmatization in many of our pre-processing pipelines. However, depending on the size of the data being processed, this isn’t always a viable option, so we also use the alternatives. This shows that there isn’t a one-size-fits-all solution in NLP but rather a variety of tools used in conjunction with one another.

What does a typical NLP pipeline look like for you? Do you use lemmatization or its alternatives? Let us know in the comments below.

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.