What is stemming?
Stemming is the process of reducing a word to its base or root form. For example, the stem of the word “running” is “run,” and the stem of the word “swimming” is “swim.” Stemming is often used in natural language processing tasks to standardise the text and improve the performance of algorithms. In addition, it can help group words with the same meaning, even if they have different forms. For example, the words “run,” “running,” and “runs” would all be reduced to the stem “run,” which would make it easier for an algorithm to identify that they are related.
Table of Contents
Use case
Stemming is used in the search function of blogs.
One potential use case for stemming in a blog post is to improve the search functionality of the blog. This can help users find the content they’re looking for more quickly and improve the overall user experience of the blog.
To illustrate this use case, let’s say that a user searches for posts on the topic of “running.” Without stemming, the search engine would only return results for posts that contain the exact word “running.” However, with stemming, the search engine would also return results for posts that contain the words “running” and “runs” because they all have the same base form: “run.”
This can be especially useful for a blog that covers a wide range of topics, as it can help users find relevant content even if they don’t use the exact wording of the post’s title or content. It can also help users who may need to become more familiar with all of the different variations of a word, as they can still find the content they’re looking for by using the base form of the word.
Overall, incorporating stemming into a blog’s search functionality can help users find the content they’re looking for more quickly and improve their experience on the blog.
Advantages of stemming
There are several advantages to using stemming in natural language processing tasks. One of the main advantages is that it can help improve the performance of algorithms by reducing the number of unique words that need to be processed. This can make the algorithm run faster and more efficiently.
Another advantage of stemming is that it can help group together words with the same meaning, even if they have different forms. This can be useful in tasks like document classification, where it is vital to identify the key topics or themes in a document.
Stemming also helps reduce the size of the vocabulary that needs to be processed. Reducing words to their base forms makes it possible to reduce the number of unique words in a document, making it easier to analyse and understand.
Finally, by standardising the text, stemming can make it easier to work with. Reducing all words to their base forms makes it easier to compare and analyse the text, which can be helpful in tasks like sentiment analysis or topic modelling.
Disadvantages of stemming
Overstemming and understemming are two problems that can arise in stemming.
Overstemming occurs when a stemmer reduces a word to its base form too aggressively, resulting in a stem that is not a valid word. For example, the word “fishing” might be overstemmed to “fishin,” which is not correct.
Understemming occurs when a stemmer reduces a word to its base form, resulting in a stem that is still an inflected form of the original word. For example, the word “fishing” might be understemmed to “fish,” which is a valid word but still not the base form of the original word.
Both overstemming and understemming can lead to poor performance of natural language processing algorithms, so it is vital to use a stemmer that can strike a balance between these two problems.
Another disadvantage of stemming is that it can sometimes produce words that have multiple meanings. For example, the term “run” can have many different meanings, such as to move quickly on foot, to operate or control, or to be in charge of. When this word is stemmed, it could be difficult for an algorithm to determine which meaning is intended in a particular context.
Finally, stemming can also lose important information about the structure of words. For example, “teaching” could stem from “teach,” which yields the critical information that the original word is a noun rather than a verb. This can make it difficult for algorithms to analyse the structure and meaning of sentences accurately.
Algorithms
Many different stemming algorithms have been developed for natural language processing tasks. Some common stemming algorithms include:
- Porter stemmer: This algorithm is based on a set of rules applied to words to reduce them to their base form. It is one of the most widely used stemming algorithms and is implemented in the NLTK library for Python.
- Snowball stemmer: This algorithm is based on the Porter stemmer but has a more aggressive set of rules for reducing words to their base form. It is implemented in the NLTK library for Python.
- Lancaster stemmer: This algorithm is based on the Porter stemmer but uses a more aggressive set of rules for reducing words to their base form. It is implemented in the NLTK library for Python.
These are just a few examples of stemming algorithms. Many other algorithms have been developed for this purpose, and new ones are continually being developed and improved.
Alternatives
One alternative to stemming is lemmatization. This is the process of reducing a word to its base form, but unlike stemming, lemmatization takes into account the context and part of speech of the word, which can produce more accurate results. For example, the term “teaching” would be lemmatized to “teach,” which retains the necessary information that it is a noun, rather than being stemmed to “teach,” which loses this information.
Another alternative to stemming is a dictionary-based approach, where words are mapped to their base forms using a pre-defined dictionary. This can produce more accurate results than stemming, but it can also be more time-consuming and require more resources.
Another approach that can be used in some cases is to use synonyms or related words to group words together rather than reducing them to their base forms. This can help retain more of the original information in the words. However, it can also be more complex to implement and may only sometimes produce the desired results.
Implementations in python
1. NLTK stemming
In the NLTK library for Python, the
PorterStemmer
class can be used to perform stemming. Here is an example of using the
PorterStemmer
to stem a list of words:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ["run", "running", "ran"]
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)
# prints: ["run", "run", "ran"]
In this example, the
PorterStemmer
is first imported from the
nltk.stem
module. Then, an instance of the
PorterStemmer
class is created. Next, a list of words is defined. Finally, the
stem
method is called on each word in the list, and the resulting list of stemmed words is printed.
2. SpaCy stemming
In the spaCy library for Python, stemming is performed using the
lemma_
attribute of a token. Here is an example of using this attribute to stem a list of words:
import spacy
nlp = spacy.load("en_core_web_sm")
words = ["run", "running", "ran"]
stemmed_words = [nlp(word)[0].lemma_ for word in words]
print(stemmed_logs)
# prints: ["run", "run", "run"]
In this example, the
en_core_web_sm
model is first loaded using the
spacy.load
method. Then, a list of words is defined. Finally, the
lemma_
attribute is accessed for each word in the list, and the resulting list of stemmed words is printed.
3. Gensim stemming
In the Gensim library for Python, stemming is performed using the
stem_text
method of the
utils.Stemmer
class. Here is an example of using this method to stem a list of words:
from gensim.utils import Stemmer
stemmer = Stemmer("english")
words = ["run", "running", "ran"]
stemmed_words = [stemmer.stem_text(word) for word in words]
print(stemmed_words)
# prints: ["run", "run", "ran"]
In this example, the
Stemmer
class is first imported from the
gensim.utils
module. Then, an instance of the
Stemmer
class is created, specifying the language as “English.” Next, a list of words is defined. Finally, the
stem_text
method is called on each word in the list, and the resulting list of stemmed words is printed.
Key Takeaways
Stemming is one of the key NLP techniques to know. And for good reason, it helps reduce text size and stops us from dealing with the curse of dimensionality.
It’s an easy and fast technique to implement and is often at the beginning of an NLP pipeline. Depending on your need for speed vs the required accuracy, we often choose between the different stemming algorithms and lemmatization. See our blog post on lemmatization for some further reading and code examples.
NLTK, SpaCy and Gensim all have ready-to-use implementations, which makes it easy to get started analysing your text straight away.
Do you frequently use stemming in your projects? Or do you prefer the alternatives? Let us know in the comments!
0 Comments