Stop words are commonly used words that have very little meaning, such as “a,” “an,” “the,” or “in.” Stopwords are typically excluded from natural language processing (NLP) and information retrieval applications because they do not contribute much to the meaning or context of the text.
Stop words can remove common words from the text.
In many NLP and information retrieval applications, words are filtered out of the text data before further processing. This can reduce the dimensionality of the data and make the algorithms more efficient and effective. For example, removing stopwords from a document can help a text classification algorithm focus on the most important and relevant words and assign the document to the correct category or label.
Many lists are available, and the specific list will depend on the language and domain of the text data. Some common stopwords in English, for example, include:
Keep in mind that these words are only sometimes meaningful or irrelevant. However, in some cases, including or excluding stopwords can affect the meaning or context of the text and may impact the performance of NLP and information retrieval algorithms. Therefore, it is essential to carefully consider which stopwords to use and how to use them in your application.
There are both advantages and disadvantages to removing stopwords from text data. Some of the benefits of stopword removal include the following:
However, there are also some disadvantages to stopword removal, including:
Whether or not to remove stopwords depends on the specific task and the desired outcome. In some cases, stopword removal can be beneficial, but in other cases, it may be better to keep the stopwords in the text data.
To remove stopwords with Python, you can use a pre-built list in a library such as NLTK or create your own list of stopwords.
Here is an example of how to remove stopwords using NLTK:
import nltk from nltk.corpus
import stopwords nltk.download('stopwords')
# Create a set of stop words
stop_words = set(stopwords.words('english'))
# Define a function to remove stop words from a sentence
def remove_stop_words(sentence):
# Split the sentence into individual words
words = sentence.split()
# Use a list comprehension to remove stop words
filtered_words = [word for word in words if word not in stop_words]
# Join the filtered words back into a sentence
return ' '.join(filtered_words)
In this example, the NLTK library is imported, and the stopwords.words
function is used to create a set of stop words in English. Then, a function called remove_stop_words
is defined, which takes a sentence as input and splits it into individual words. A list comprehension is used to remove any words that are in the stopword set, and the filtered words are joined back into a sentence and returned.
To use this function, you can simply call it on a sentence, and it will return the sentence with the stopwords removed. For example:
sentence = "This is an example sentence with stopwords."
filtered_sentence = remove_stop_words(sentence)
print(filtered_sentence)
# Output: "example sentence stopwords."
In this case, the stopwords “this”, “is”, “an”, “with”, and “the” would be removed from the input sentence.
To remove stopwords with spaCy, you can use the spacy.lang.en.stop_words.STOP_WORDS
attribute to get a set of stopwords in English and then use the token.is_stop attribute to check if a token is a stopword. Here is an example of how to remove stopwords using spaCy:
import spacy nlp = spacy.load('en_core_web_sm')
# Create a set of stop words
stop_words = spacy.lang.en.stop_words.STOP_WORDS
# Define a function to remove stop words from a sentence
def remove_stop_words(sentence):
# Parse the sentence using spaCy
doc = nlp(sentence)
# Use a list comprehension to remove stop words
filtered_tokens = [token for token in doc if not token.is_stop]
# Join the filtered tokens back into a sentence
return ' '.join([token.text for token in filtered_tokens])
In this example, SpaCy is used to parse a sentence and identify the individual tokens. A list comprehension removes any tokens that are stopwords, and the filtered tokens are joined back into a sentence and returned.
To use this function, you can simply call it on a sentence, and it will return the sentence with the stopwords removed. For example:
sentence = "This is an example sentence with stop words."
filtered_sentence = remove_stop_words(sentence)
print(filtered_sentence)
# Output: "example sentence stop words."
In this case, the words “this”, “is”, “an”, “with”, and “the” would be removed from the input sentence.
To remove stop words using the Gensim library, you can use the gensim.parsing.remove_stopwords
method. This method takes a list of words as input and returns a list of words with the stopwords removed.
Here is an example of how to use this method to remove stopwords from a list of words:
from gensim.parsing.preprocessing import remove_stopwords
# Define a list of words
words = ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
# Remove the stop words
filtered_words = remove_stopwords(words)
# Print the filtered list of words
print(filtered_words)
# Output: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
As you can see, the “the” and “over” have been removed from the list. Note that the Gensim library includes a default list of stopwords, but you can also specify your custom list if needed.
It can be incredibly useful to create your specific list of irrelevant words. For example, when analysing social media content, you will encounter many irrelevant sequences of characters that you may not wish to analyse. Think of “RT”, a “re-tweet” on Twitter. You could automatically add the “RT” to a specific stop word list and remove these characters.
To make a domain-specific list, you need to figure out the most common words in your domain or subject area that don’t have much to do with your task. This usually means looking at a lot of text data from your domain and finding the words used most often. Once you have determined these words, you can add them to your stopword list.
To create your stopword list in Python, you can simply define a list of strings containing the ones you want to use. For example:
stop_words = ["a", "an", "the", "and", "but", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "can", "will", "just"]
You can then use this list in your code to remove them from text data. For example, if you have a string containing some text, you can use the .split()
method to split the string into a list of words, and then use a for
loop to iterate over the list of words and remove any words that are in the stop word list:
# Define a string containing some text
text = "The quick brown fox jumps over the lazy dog."
# Split the string into a list of words
words = text.split()
# Create a new list to hold the filtered words
filtered_words = []
# Iterate over the list of words
for word in words:
# If the word is not in the stop word list, add it to the filtered list
if word not in stop_words:
filtered_words.append(word)
# Print the filtered list of words
print(filtered_words)
# Output: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog.']
As you can see, the words from the list have been removed from the text. Note that you can use this technique with any list of stopwords, whether it is the default list in one of the libraries or your own.
At Spot Intelligence, this is also one of our favourite techniques. It allows us to quickly analyse a new data set and run it through machine learning models without losing much of the interpretability you would get with other more advanced techniques like word embedding and sentence embedding.
What are your favourite pre-processing techniques to use in NLP? Let us know in the comments.
Have you ever wondered why raising interest rates slows down inflation, or why cutting down…
Introduction Reinforcement Learning (RL) has seen explosive growth in recent years, powering breakthroughs in robotics,…
Introduction Imagine a group of robots cleaning a warehouse, a swarm of drones surveying a…
Introduction Imagine trying to understand what someone said over a noisy phone call or deciphering…
What is Structured Prediction? In traditional machine learning tasks like classification or regression a model…
Introduction Reinforcement Learning (RL) is a powerful framework that enables agents to learn optimal behaviours…
View Comments
I want to make a summarizer system for hinglish words in to English summary. Please suggest the methodology of that.
Hi Sonia,
You could first either translate your text to english and then summarise it or first summarise it and then translate it.
Here is an article on how to implement summarization in Python. And here is one to help you with the translation.
All the best,
Neri