Top 3 Ways To Implement Stemming In Python With Advantages, Disadvantages, Different Algorithms & Alternatives

by | Dec 14, 2022 | Natural Language Processing

What is stemming?

Stemming is the process of reducing a word to its base or root form. For example, the stem of the word “running” is “run,” and the stem of the word “swimming” is “swim.” Stemming is often used in natural language processing tasks to standardise the text and improve the performance of algorithms. In addition, it can help group words with the same meaning, even if they have different forms. For example, the words “run,” “running,” and “runs” would all be reduced to the stem “run,” which would make it easier for an algorithm to identify that they are related.

Use case

stemming is used in the search function of blogs

Stemming is used in the search function of blogs.

One potential use case for stemming in a blog post is to improve the search functionality of the blog. This can help users find the content they’re looking for more quickly and improve the overall user experience of the blog.

To illustrate this use case, let’s say that a user searches for posts on the topic of “running.” Without stemming, the search engine would only return results for posts that contain the exact word “running.” However, with stemming, the search engine would also return results for posts that contain the words “running” and “runs” because they all have the same base form: “run.”

This can be especially useful for a blog that covers a wide range of topics, as it can help users find relevant content even if they don’t use the exact wording of the post’s title or content. It can also help users who may need to become more familiar with all of the different variations of a word, as they can still find the content they’re looking for by using the base form of the word.

Overall, incorporating stemming into a blog’s search functionality can help users find the content they’re looking for more quickly and improve their experience on the blog.

Advantages of stemming

There are several advantages to using stemming in natural language processing tasks. One of the main advantages is that it can help improve the performance of algorithms by reducing the number of unique words that need to be processed. This can make the algorithm run faster and more efficiently.

Another advantage of stemming is that it can help group together words with the same meaning, even if they have different forms. This can be useful in tasks like document classification, where it is vital to identify the key topics or themes in a document.

Stemming also helps reduce the size of the vocabulary that needs to be processed. Reducing words to their base forms makes it possible to reduce the number of unique words in a document, making it easier to analyse and understand.

Finally, by standardising the text, stemming can make it easier to work with. Reducing all words to their base forms makes it easier to compare and analyse the text, which can be helpful in tasks like sentiment analysis or topic modelling.

Disadvantages of stemming

Overstemming and understemming are two problems that can arise in stemming.

Overstemming occurs when a stemmer reduces a word to its base form too aggressively, resulting in a stem that is not a valid word. For example, the word “fishing” might be overstemmed to “fishin,” which is not correct.

Understemming occurs when a stemmer reduces a word to its base form, resulting in a stem that is still an inflected form of the original word. For example, the word “fishing” might be understemmed to “fish,” which is a valid word but still not the base form of the original word.

Both overstemming and understemming can lead to poor performance of natural language processing algorithms, so it is vital to use a stemmer that can strike a balance between these two problems.

Another disadvantage of stemming is that it can sometimes produce words that have multiple meanings. For example, the term “run” can have many different meanings, such as to move quickly on foot, to operate or control, or to be in charge of. When this word is stemmed, it could be difficult for an algorithm to determine which meaning is intended in a particular context.

Finally, stemming can also lose important information about the structure of words. For example, “teaching” could stem from “teach,” which yields the critical information that the original word is a noun rather than a verb. This can make it difficult for algorithms to analyse the structure and meaning of sentences accurately.

Algorithms

Many different stemming algorithms have been developed for natural language processing tasks. Some common stemming algorithms include:

  • Porter stemmer: This algorithm is based on a set of rules applied to words to reduce them to their base form. It is one of the most widely used stemming algorithms and is implemented in the NLTK library for Python.
  • Snowball stemmer: This algorithm is based on the Porter stemmer but has a more aggressive set of rules for reducing words to their base form. It is implemented in the NLTK library for Python.
  • Lancaster stemmer: This algorithm is based on the Porter stemmer but uses a more aggressive set of rules for reducing words to their base form. It is implemented in the NLTK library for Python.

These are just a few examples of stemming algorithms. Many other algorithms have been developed for this purpose, and new ones are continually being developed and improved.

Alternatives

One alternative to stemming is lemmatization. This is the process of reducing a word to its base form, but unlike stemming, lemmatization takes into account the context and part of speech of the word, which can produce more accurate results. For example, the term “teaching” would be lemmatized to “teach,” which retains the necessary information that it is a noun, rather than being stemmed to “teach,” which loses this information.

Another alternative to stemming is a dictionary-based approach, where words are mapped to their base forms using a pre-defined dictionary. This can produce more accurate results than stemming, but it can also be more time-consuming and require more resources.

Another approach that can be used in some cases is to use synonyms or related words to group words together rather than reducing them to their base forms. This can help retain more of the original information in the words. However, it can also be more complex to implement and may only sometimes produce the desired results.

Implementations in python

1. NLTK stemming

In the NLTK library for Python, the PorterStemmer class can be used to perform stemming. Here is an example of using the PorterStemmer to stem a list of words:

from nltk.stem import PorterStemmer 

stemmer = PorterStemmer() 

words = ["run", "running", "ran"] 

stemmed_words = [stemmer.stem(word) for word in words] 

print(stemmed_words) 
# prints: ["run", "run", "ran"]

In this example, the PorterStemmer is first imported from the nltk.stem module. Then, an instance of the PorterStemmer class is created. Next, a list of words is defined. Finally, the stem method is called on each word in the list, and the resulting list of stemmed words is printed.

2. SpaCy stemming

In the spaCy library for Python, stemming is performed using the lemma_ attribute of a token. Here is an example of using this attribute to stem a list of words:

import spacy 

nlp = spacy.load("en_core_web_sm") 

words = ["run", "running", "ran"] 

stemmed_words = [nlp(word)[0].lemma_ for word in words] 

print(stemmed_logs) 
# prints: ["run", "run", "run"]

In this example, the en_core_web_sm model is first loaded using the spacy.load method. Then, a list of words is defined. Finally, the lemma_ attribute is accessed for each word in the list, and the resulting list of stemmed words is printed.

3. Gensim stemming

In the Gensim library for Python, stemming is performed using the stem_text method of the utils.Stemmer class. Here is an example of using this method to stem a list of words:

from gensim.utils import Stemmer 

stemmer = Stemmer("english") 

words = ["run", "running", "ran"] 

stemmed_words = [stemmer.stem_text(word) for word in words] 

print(stemmed_words) 
# prints: ["run", "run", "ran"]

In this example, the Stemmer class is first imported from the gensim.utils module. Then, an instance of the Stemmer class is created, specifying the language as “English.” Next, a list of words is defined. Finally, the stem_text method is called on each word in the list, and the resulting list of stemmed words is printed.

Key Takeaways

Stemming is one of the key NLP techniques to know. And for good reason, it helps reduce text size and stops us from dealing with the curse of dimensionality.

It’s an easy and fast technique to implement and is often at the beginning of an NLP pipeline. Depending on your need for speed vs the required accuracy, we often choose between the different stemming algorithms and lemmatization. See our blog post on lemmatization for some further reading and code examples.

NLTK, SpaCy and Gensim all have ready-to-use implementations, which makes it easy to get started analysing your text straight away.

Do you frequently use stemming in your projects? Or do you prefer the alternatives? Let us know in the comments!

About the Author

Neri Van Otten

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.

Related Articles

Sequence-to-sequence encoder-decoder architecture

Sequence-to-Sequence Architecture Made Easy & How To Tutorial In Python

What is sequence-to-sequence? Sequence-to-sequence (Seq2Seq) is a deep learning architecture used in natural language processing (NLP) and other sequence modelling...

Cross-entropy can be interpreted as a measure of how well the predicted probability distribution aligns with the true distribution.

Cross-Entropy Loss — Crucial In Machine Learning — Complete Guide & How To Use It

What is cross-entropy loss? Cross-entropy Loss, often called "cross-entropy," is a loss function commonly used in machine learning and deep learning, particularly in...

nlg can generate product descriptions

Natural Language Generation Explained & 2 How To Tutorials In Python

What is natural language generation? Natural Language Generation (NLG) is a subfield of artificial intelligence (AI) and natural language processing (NLP) that focuses...

y_actual - y_predicted

Top 8 Loss Functions Made Simple & How To Implement Them In Python

What are loss functions? Loss functions, also known as a cost or objective functions, are critical component in training machine learning models. It quantifies a...

chatbots are commonly used for Cross-lingual Transfer Learning

How To Implement Cross-lingual Transfer Learning In 5 Different Ways

What is cross-lingual transfer learning? Cross-lingual transfer learning is a machine learning technique that involves transferring knowledge or models from one...

In text labelling and classification, each document or piece of text is assigned to one or more predefined categories or classes

Text Labelling Made Simple With How To Guide & Tools List

What is text labelling? Text labelling, or text annotation or tagging, assigns labels or categories to text data to make it more understandable and usable for various...

Automatically identifying these languages is crucial for search engines, content recommendation systems, and social media platforms.

Language Identification Complete How To Guide In Python [With & Without Libraries]

What is language identification? Language identification is a critical component of Natural Language Processing (NLP), a field dedicated to interacting with computers...

Multilingual NLP is important for an ever globalising world

Multilingual NLP Made Simple — Challenges, Solutions & The Future

Understanding Multilingual NLP In the era of globalization and digital interconnectedness, the ability to understand and process multiple languages is no longer a...

text cleaning is all about the right tools

Top 20 Essential Text Cleaning Techniques [Practical How To Guide In Python]

What is text cleaning in NLP? Text cleaning, also known as text preprocessing or text data cleansing, is preparing and transforming raw text data into a cleaner, more...

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

nlp trends

2023 NLP Expert Trend Predictions

Get a FREE PDF with expert predictions for 2023. How will natural language processing (NLP) impact businesses? What can we expect from the state-of-the-art models?

Find out this and more by subscribing* to our NLP newsletter.

You have Successfully Subscribed!