Top 3 Ways To Implement Stemming In Python With Advantages, Disadvantages, Different Algorithms & Alternatives

by | Dec 14, 2022 | Natural Language Processing

What is stemming?

Stemming is the process of reducing a word to its base or root form. For example, the stem of the word “running” is “run,” and the stem of the word “swimming” is “swim.” Stemming is often used in natural language processing tasks to standardise the text and improve the performance of algorithms. In addition, it can help group words with the same meaning, even if they have different forms. For example, the words “run,” “running,” and “runs” would all be reduced to the stem “run,” which would make it easier for an algorithm to identify that they are related.

Use case

stemming is used in the search function of blogs

Stemming is used in the search function of blogs.

One potential use case for stemming in a blog post is to improve the search functionality of the blog. This can help users find the content they’re looking for more quickly and improve the overall user experience of the blog.

To illustrate this use case, let’s say that a user searches for posts on the topic of “running.” Without stemming, the search engine would only return results for posts that contain the exact word “running.” However, with stemming, the search engine would also return results for posts that contain the words “running” and “runs” because they all have the same base form: “run.”

This can be especially useful for a blog that covers a wide range of topics, as it can help users find relevant content even if they don’t use the exact wording of the post’s title or content. It can also help users who may need to become more familiar with all of the different variations of a word, as they can still find the content they’re looking for by using the base form of the word.

Overall, incorporating stemming into a blog’s search functionality can help users find the content they’re looking for more quickly and improve their experience on the blog.

Advantages of stemming

There are several advantages to using stemming in natural language processing tasks. One of the main advantages is that it can help improve the performance of algorithms by reducing the number of unique words that need to be processed. This can make the algorithm run faster and more efficiently.

Another advantage of stemming is that it can help group together words with the same meaning, even if they have different forms. This can be useful in tasks like document classification, where it is vital to identify the key topics or themes in a document.

Stemming also helps reduce the size of the vocabulary that needs to be processed. Reducing words to their base forms makes it possible to reduce the number of unique words in a document, making it easier to analyse and understand.

Finally, by standardising the text, stemming can make it easier to work with. Reducing all words to their base forms makes it easier to compare and analyse the text, which can be helpful in tasks like sentiment analysis or topic modelling.

Disadvantages of stemming

Overstemming and understemming are two problems that can arise in stemming.

Overstemming occurs when a stemmer reduces a word to its base form too aggressively, resulting in a stem that is not a valid word. For example, the word “fishing” might be overstemmed to “fishin,” which is not correct.

Understemming occurs when a stemmer reduces a word to its base form, resulting in a stem that is still an inflected form of the original word. For example, the word “fishing” might be understemmed to “fish,” which is a valid word but still not the base form of the original word.

Both overstemming and understemming can lead to poor performance of natural language processing algorithms, so it is vital to use a stemmer that can strike a balance between these two problems.

Another disadvantage of stemming is that it can sometimes produce words that have multiple meanings. For example, the term “run” can have many different meanings, such as to move quickly on foot, to operate or control, or to be in charge of. When this word is stemmed, it could be difficult for an algorithm to determine which meaning is intended in a particular context.

Finally, stemming can also lose important information about the structure of words. For example, “teaching” could stem from “teach,” which yields the critical information that the original word is a noun rather than a verb. This can make it difficult for algorithms to analyse the structure and meaning of sentences accurately.

Algorithms

Many different stemming algorithms have been developed for natural language processing tasks. Some common stemming algorithms include:

  • Porter stemmer: This algorithm is based on a set of rules applied to words to reduce them to their base form. It is one of the most widely used stemming algorithms and is implemented in the NLTK library for Python.
  • Snowball stemmer: This algorithm is based on the Porter stemmer but has a more aggressive set of rules for reducing words to their base form. It is implemented in the NLTK library for Python.
  • Lancaster stemmer: This algorithm is based on the Porter stemmer but uses a more aggressive set of rules for reducing words to their base form. It is implemented in the NLTK library for Python.

These are just a few examples of stemming algorithms. Many other algorithms have been developed for this purpose, and new ones are continually being developed and improved.

Alternatives

One alternative to stemming is lemmatization. This is the process of reducing a word to its base form, but unlike stemming, lemmatization takes into account the context and part of speech of the word, which can produce more accurate results. For example, the term “teaching” would be lemmatized to “teach,” which retains the necessary information that it is a noun, rather than being stemmed to “teach,” which loses this information.

Another alternative to stemming is a dictionary-based approach, where words are mapped to their base forms using a pre-defined dictionary. This can produce more accurate results than stemming, but it can also be more time-consuming and require more resources.

Another approach that can be used in some cases is to use synonyms or related words to group words together rather than reducing them to their base forms. This can help retain more of the original information in the words. However, it can also be more complex to implement and may only sometimes produce the desired results.

Implementations in python

1. NLTK stemming

In the NLTK library for Python, the PorterStemmer class can be used to perform stemming. Here is an example of using the PorterStemmer to stem a list of words:

from nltk.stem import PorterStemmer 

stemmer = PorterStemmer() 

words = ["run", "running", "ran"] 

stemmed_words = [stemmer.stem(word) for word in words] 

print(stemmed_words) 
# prints: ["run", "run", "ran"]

In this example, the PorterStemmer is first imported from the nltk.stem module. Then, an instance of the PorterStemmer class is created. Next, a list of words is defined. Finally, the stem method is called on each word in the list, and the resulting list of stemmed words is printed.

2. SpaCy stemming

In the spaCy library for Python, stemming is performed using the lemma_ attribute of a token. Here is an example of using this attribute to stem a list of words:

import spacy 

nlp = spacy.load("en_core_web_sm") 

words = ["run", "running", "ran"] 

stemmed_words = [nlp(word)[0].lemma_ for word in words] 

print(stemmed_logs) 
# prints: ["run", "run", "run"]

In this example, the en_core_web_sm model is first loaded using the spacy.load method. Then, a list of words is defined. Finally, the lemma_ attribute is accessed for each word in the list, and the resulting list of stemmed words is printed.

3. Gensim stemming

In the Gensim library for Python, stemming is performed using the stem_text method of the utils.Stemmer class. Here is an example of using this method to stem a list of words:

from gensim.utils import Stemmer 

stemmer = Stemmer("english") 

words = ["run", "running", "ran"] 

stemmed_words = [stemmer.stem_text(word) for word in words] 

print(stemmed_words) 
# prints: ["run", "run", "ran"]

In this example, the Stemmer class is first imported from the gensim.utils module. Then, an instance of the Stemmer class is created, specifying the language as “English.” Next, a list of words is defined. Finally, the stem_text method is called on each word in the list, and the resulting list of stemmed words is printed.

Key Takeaways

Stemming is one of the key NLP techniques to know. And for good reason, it helps reduce text size and stops us from dealing with the curse of dimensionality.

It’s an easy and fast technique to implement and is often at the beginning of an NLP pipeline. Depending on your need for speed vs the required accuracy, we often choose between the different stemming algorithms and lemmatization. See our blog post on lemmatization for some further reading and code examples.

NLTK, SpaCy and Gensim all have ready-to-use implementations, which makes it easy to get started analysing your text straight away.

Do you frequently use stemming in your projects? Or do you prefer the alternatives? Let us know in the comments!

About the Author

Neri Van Otten

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.

Recent Articles

fact checking with large language models LLMs

Fact-Checking With Large Language Models (LLMs): Is It A Powerful NLP Verification Tool?

Can a Machine Tell a Lie? Picture this: you're scrolling through social media, bombarded by claims about the latest scientific breakthrough, political scandal, or...

key elements of cognitive computing

Cognitive Computing Made Simple: Powerful Artificial Intelligence (AI) Capabilities & Examples

What is Cognitive Computing? The term "cognitive computing" has become increasingly prominent in today's rapidly evolving technological landscape. As our society...

Multilayer Perceptron Architecture

Multilayer Perceptron Explained And How To Train & Optimise MLPs

What is a Multilayer perceptron (MLP)? In artificial intelligence and machine learning, the Multilayer Perceptron (MLP) stands as one of the foundational architectures,...

Left: Illustration of SGD optimization with a typical learning rate schedule. The model converges to a minimum at the end of training. Right: Illustration of Snapshot Ensembling. The model undergoes several learning rate annealing cycles, converging to and escaping from multiple local minima. We take a snapshot at each minimum for test-time ensembling

Learning Rate In Machine Learning And Deep Learning Made Simple

Machine learning algorithms are at the core of many modern technological advancements, powering everything from recommendation systems to autonomous vehicles....

What causes the cold-start problem?

The Cold-Start Problem In Machine Learning Explained & 6 Mitigating Strategies

What is the Cold-Start Problem in Machine Learning? The cold-start problem refers to a common challenge encountered in machine learning systems, particularly in...

Nodes and edges in a bayesian network

Bayesian Network Made Simple [How It Is Used In Artificial Intelligence & Machine Learning]

What is a Bayesian Network? Bayesian network, also known as belief networks or Bayes nets, are probabilistic graphical models representing random variables and their...

Query2vec is an example of knowledge graph reasoning. Conjunctive queries: Where did Canadian citizens with Turing Award Graduate?

Knowledge Graph Reasoning Made Simple [3 Technical Methods & How To Handle Uncertanty]

What is Knowledge Graph Reasoning? Knowledge Graph Reasoning refers to drawing logical inferences, making deductions, and uncovering implicit information within a...

the process of speech recognition

How To Implement Speech Recognition [3 Ways & 7 Machine Learning Models]

What is Speech Recognition? Speech recognition, also known as automatic speech recognition (ASR) or voice recognition, is a technology that converts spoken language...

Key components of conversational AI

Conversational AI Explained: Top 9 Tools & How To Guide [Including GPT]

What is Conversational AI? Conversational AI, short for Conversational Artificial Intelligence, refers to using artificial intelligence and natural language processing...

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

nlp trends

2024 NLP Expert Trend Predictions

Get a FREE PDF with expert predictions for 2024. How will natural language processing (NLP) impact businesses? What can we expect from the state-of-the-art models?

Find out this and more by subscribing* to our NLP newsletter.

You have Successfully Subscribed!