How To Get Started With Stemming: Advantages, Disadvantages And Code

by | Dec 14, 2022 | Natural Language Processing

Stemming is the process of reducing a word to its base or root form. For example, the stem of the word “running” is “run,” and the stem of the word “swimming” is “swim.” Stemming is often used in natural language processing tasks to standardise the text and improve the performance of algorithms. In addition, it can help group words with the same meaning, even if they have different forms. For example, the words “run,” “running,” and “runs” would all be reduced to the stem “run,” which would make it easier for an algorithm to identify that they are related.

Use case

stemming is used in the search function of blogs

Stemming is used in the search function of blogs.

One potential use case for stemming in a blog post is to improve the search functionality of the blog. This can help users find the content they’re looking for more quickly and improve the overall user experience of the blog.

To illustrate this use case, let’s say that a user searches for posts on the topic of “running.” Without stemming, the search engine would only return results for posts that contain the exact word “running.” However, with stemming, the search engine would also return results for posts that contain the words “running” and “runs” because they all have the same base form: “run.”

This can be especially useful for a blog that covers a wide range of topics, as it can help users find relevant content even if they don’t use the exact wording of the post’s title or content. It can also help users who may need to become more familiar with all of the different variations of a word, as they can still find the content they’re looking for by using the base form of the word.

Overall, incorporating stemming into a blog’s search functionality can help users find the content they’re looking for more quickly and improve their experience on the blog.

Advantages of stemming

There are several advantages to using stemming in natural language processing tasks. One of the main advantages is that it can help improve the performance of algorithms by reducing the number of unique words that need to be processed. This can make the algorithm run faster and more efficiently.

Another advantage of stemming is that it can help group together words with the same meaning, even if they have different forms. This can be useful in tasks like document classification, where it is vital to identify the key topics or themes in a document.

Stemming also helps reduce the size of the vocabulary that needs to be processed. Reducing words to their base forms makes it possible to reduce the number of unique words in a document, making it easier to analyse and understand.

Finally, by standardising the text, stemming can make it easier to work with. Reducing all words to their base forms makes it easier to compare and analyse the text, which can be helpful in tasks like sentiment analysis or topic modelling.

Disadvantages of stemming

Overstemming and understemming are two problems that can arise in stemming.

Overstemming occurs when a stemmer reduces a word to its base form too aggressively, resulting in a stem that is not a valid word. For example, the word “fishing” might be overstemmed to “fishin,” which is not correct.

Understemming occurs when a stemmer reduces a word to its base form, resulting in a stem that is still an inflected form of the original word. For example, the word “fishing” might be understemmed to “fish,” which is a valid word but still not the base form of the original word.

Both overstemming and understemming can lead to poor performance of natural language processing algorithms, so it is vital to use a stemmer that can strike a balance between these two problems.

Another disadvantage of stemming is that it can sometimes produce words that have multiple meanings. For example, the term “run” can have many different meanings, such as to move quickly on foot, to operate or control, or to be in charge of. When this word is stemmed, it could be difficult for an algorithm to determine which meaning is intended in a particular context.

Finally, stemming can also lose important information about the structure of words. For example, “teaching” could stem from “teach,” which yields the critical information that the original word is a noun rather than a verb. This can make it difficult for algorithms to analyse the structure and meaning of sentences accurately.

Algorithms

Many different stemming algorithms have been developed for natural language processing tasks. Some common stemming algorithms include:

  • Porter stemmer: This algorithm is based on a set of rules applied to words to reduce them to their base form. It is one of the most widely used stemming algorithms and is implemented in the NLTK library for Python.
  • Snowball stemmer: This algorithm is based on the Porter stemmer but has a more aggressive set of rules for reducing words to their base form. It is implemented in the NLTK library for Python.
  • Lancaster stemmer: This algorithm is based on the Porter stemmer but uses a more aggressive set of rules for reducing words to their base form. It is implemented in the NLTK library for Python.

These are just a few examples of stemming algorithms. Many other algorithms have been developed for this purpose, and new ones are continually being developed and improved.

Alternatives

One alternative to stemming is lemmatization. This is the process of reducing a word to its base form, but unlike stemming, lemmatization takes into account the context and part of speech of the word, which can produce more accurate results. For example, the term “teaching” would be lemmatized to “teach,” which retains the necessary information that it is a noun, rather than being stemmed to “teach,” which loses this information.

Another alternative to stemming is a dictionary-based approach, where words are mapped to their base forms using a pre-defined dictionary. This can produce more accurate results than stemming, but it can also be more time-consuming and require more resources.

Another approach that can be used in some cases is to use synonyms or related words to group words together rather than reducing them to their base forms. This can help retain more of the original information in the words. However, it can also be more complex to implement and may only sometimes produce the desired results.

Implementations in python

NLTK stemming

In the NLTK library for Python, the PorterStemmer class can be used to perform stemming. Here is an example of using the PorterStemmer to stem a list of words:

from nltk.stem import PorterStemmer 

stemmer = PorterStemmer() 

words = ["run", "running", "ran"] 

stemmed_words = [stemmer.stem(word) for word in words] 

print(stemmed_words) 
# prints: ["run", "run", "ran"]

In this example, the PorterStemmer is first imported from the nltk.stem module. Then, an instance of the PorterStemmer class is created. Next, a list of words is defined. Finally, the stem method is called on each word in the list, and the resulting list of stemmed words is printed.

SpaCy stemming

In the spaCy library for Python, stemming is performed using the lemma_ attribute of a token. Here is an example of using this attribute to stem a list of words:

import spacy 

nlp = spacy.load("en_core_web_sm") 

words = ["run", "running", "ran"] 

stemmed_words = [nlp(word)[0].lemma_ for word in words] 

print(stemmed_logs) 
# prints: ["run", "run", "run"]

In this example, the en_core_web_sm model is first loaded using the spacy.load method. Then, a list of words is defined. Finally, the lemma_ attribute is accessed for each word in the list, and the resulting list of stemmed words is printed.

Gensim stemming

In the Gensim library for Python, stemming is performed using the stem_text method of the utils.Stemmer class. Here is an example of using this method to stem a list of words:

from gensim.utils import Stemmer 

stemmer = Stemmer("english") 

words = ["run", "running", "ran"] 

stemmed_words = [stemmer.stem_text(word) for word in words] 

print(stemmed_words) 
# prints: ["run", "run", "ran"]

In this example, the Stemmer class is first imported from the gensim.utils module. Then, an instance of the Stemmer class is created, specifying the language as “English.” Next, a list of words is defined. Finally, the stem_text method is called on each word in the list, and the resulting list of stemmed words is printed.

Key Takeaways

Stemming is one of the key NLP techniques to know. And for good reason, it helps reduce text size and stops us from dealing with the curse of dimensionality.

It’s an easy and fast technique to implement and is often at the beginning of an NLP pipeline. Depending on your need for speed vs the required accuracy, we often choose between the different stemming algorithms and lemmatization. See our blog post on lemmatization for some further reading and code examples.

NLTK, SpaCy and Gensim all have ready-to-use implementations, which makes it easy to get started analysing your text straight away.

Do you frequently use stemming in your projects? Or do you prefer the alternatives? Let us know in the comments!

Related Articles

Understanding Elman RNN — Uniqueness & How To Implement

by | Feb 1, 2023 | artificial intelligence,Machine Learning,Natural Language Processing | 0 Comments

What is the Elman neural network? Elman Neural Network is a recurrent neural network (RNN) designed to capture and store contextual information in a hidden layer. Jeff...

Self-attention Made Easy And How To Implement It

by | Jan 31, 2023 | Machine Learning,Natural Language Processing | 0 Comments

What is self-attention in deep learning? Self-attention is a type of attention mechanism used in deep learning models, also known as the self-attention mechanism. It...

Gated Recurrent Unit Explained & How They Compare [LSTM, RNN, CNN]

by | Jan 30, 2023 | artificial intelligence,Machine Learning,Natural Language Processing | 0 Comments

What is a Gated Recurrent Unit? A Gated Recurrent Unit (GRU) is a Recurrent Neural Network (RNN) architecture type. It is similar to a Long Short-Term Memory (LSTM)...

How To Use The Top 9 Most Useful Text Normalization Techniques (NLP)

by | Jan 25, 2023 | Data Science,Natural Language Processing | 0 Comments

Text normalization is a key step in natural language processing (NLP). It involves cleaning and preprocessing text data to make it consistent and usable for different...

How To Implement POS Tagging In NLP Using Python

by | Jan 24, 2023 | Data Science,Natural Language Processing | 0 Comments

Part-of-speech (POS) tagging is fundamental in natural language processing (NLP) and can be carried out in Python. It involves labelling words in a sentence with their...

How To Start Using Transformers In Natural Language Processing

by | Jan 23, 2023 | Machine Learning,Natural Language Processing | 0 Comments

Transformers Implementations in TensorFlow, PyTorch, Hugging Face and OpenAI's GPT-3 What are transformers in natural language processing? Natural language processing...

How To Implement Different Question-Answering Systems In NLP

by | Jan 20, 2023 | artificial intelligence,Data Science,Natural Language Processing | 0 Comments

Question answering (QA) is a field of natural language processing (NLP) and artificial intelligence (AI) that aims to develop systems that can understand and answer...

The Curse Of Variability And How To Overcome It

by | Jan 20, 2023 | Data Science,Machine Learning,Natural Language Processing | 0 Comments

What is the curse of variability? The curse of variability refers to the idea that as the variability of a dataset increases, the difficulty of finding a good model...

How To Implement A Siamese Network In NLP — Made Easy

by | Jan 19, 2023 | Machine Learning,Natural Language Processing | 0 Comments

What is a Siamese network? It is also commonly known as one or a few-shot learning. They are popular because less labelled data is required to train them. Siamese...

Top 6 Most Popular Text Clustering Algorithms And How They Work

by | Jan 17, 2023 | Data Science,Machine Learning,Natural Language Processing | 0 Comments

What exactly is text clustering? The process of grouping a collection of texts into clusters based on how similar their content is is known as text clustering. Text...

Opinion Mining — More Powerful Than Just Sentiment Analysis

by | Jan 17, 2023 | Data Science,Natural Language Processing | 0 Comments

Opinion mining is a field that is growing quickly. It uses natural language processing and text analysis to gather subjective information from sources. The main goal of...

How To Implement Document Clustering In Python

by | Jan 16, 2023 | Data Science,Machine Learning,Natural Language Processing | 0 Comments

Introduction to document clustering and its importance Grouping similar documents together in Python based on their content is called document clustering, also known as...

Local Sensitive Hashing — When And How To Get Started

by | Jan 16, 2023 | Machine Learning,Natural Language Processing | 0 Comments

What is local sensitive hashing? A technique for performing a rough nearest neighbour search in high-dimensional spaces is called local sensitive hashing (LSH). It...

How To Get Started With One Hot Encoding

by | Jan 12, 2023 | Data Science,Machine Learning,Natural Language Processing | 0 Comments

Categorical variables are variables that can take on one of a limited number of values. These variables are commonly found in datasets and can't be used directly in...

Different Attention Mechanism In NLP Made Easy

by | Jan 12, 2023 | artificial intelligence,Machine Learning,Natural Language Processing | 0 Comments

Numerous tasks in natural language processing (NLP) depend heavily on an attention mechanism. When the data is being processed, they allow the model to focus on only...

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *