Stop Words — Advantages, Disadvantages And How To Get Started

by | Dec 10, 2022 | Natural Language Processing

Stop words are commonly used words that have very little meaning, such as “a,” “an,” “the,” or “in.” Stopwords are typically excluded from natural language processing (NLP) and information retrieval applications because they do not contribute much to the meaning or context of the text.

stop words get removed from text

Stop words remove common words from text.

In many NLP and information retrieval applications, words are filtered out of the text data before further processing is performed. This can reduce the dimensionality of the data and make the algorithms more efficient and effective. For example, removing stopwords from a document can help a text classification algorithm focus on the most important and relevant words and assign the document to the correct category or label.

There are many lists available, and the specific list to be used will depend on the language and domain of the text data. Some common stopwords in English, for example, include:

  • articles (a, an, the)
  • conjunctions (and, but, or)
  • prepositions (in, on, at)
  • pronouns (he, she, it, they)
  • auxiliary verbs (is, are, was, were)

Keep in mind that these words are only sometimes meaningful or irrelevant. However, in some cases, including or excluding stopwords can affect the meaning or context of the text and may impact the performance of NLP and information retrieval algorithms. Therefore, it is essential to carefully consider which stopwords to use and how to use them in your application.

Advantages and disadvantages of removing stop words

Advantages

There are both advantages and disadvantages to removing stopwords from text data. Some of the benefits of stopword removal include the following:

  • Reducing the text data size can make it more manageable and faster to process.
  • Improving the performance of natural language processing algorithms by reducing the number of irrelevant words that the algorithm needs to process.
  • Improving the interpretability of the results by removing words that do not carry much meaning.

Disadvantages

However, there are also some disadvantages to stopword removal, including:

  • The possibility of losing important information by removing words that may be significant in a specific context.
  • The subjectivity of choosing which words to include in the stopword list can affect the results of any downstream tasks.
  • The need to maintain and update the stopword list as the language and domain evolve.
  • Relevant stop word lists can be hard to find in some languages and so may not scale as more languages need to be processed.

Overall, whether or not to remove stopwords depends on the specific task and the desired outcome. In some cases, stopword removal can be beneficial, but in other cases, it may be better to keep the stopwords in the text data.

Remove stop words with Python

NLTK stop words

To remove stopwords with Python, you can use a pre-built list of stopwords in a library such as NLTK or create your list of stopwords.

Here is an example of how to remove stopwords using NLTK:

import nltk from nltk.corpus 
import stopwords nltk.download('stopwords') 

# Create a set of stop words 
stop_words = set(stopwords.words('english')) 

# Define a function to remove stop words from a sentence 
def remove_stop_words(sentence): 
  # Split the sentence into individual words 
  words = sentence.split() 
  
  # Use a list comprehension to remove stop words 
  filtered_words = [word for word in words if word not in stop_words] 
  
  # Join the filtered words back into a sentence 
  return ' '.join(filtered_words)

In this example, the NLTK library is imported, and the stopwords.words function is used to create a set of stop words in English. Then, a function called remove_stop_words is defined, which takes a sentence as input and splits it into individual words. A list comprehension is used to remove any words that are in the stopword set, and the filtered words are joined back into a sentence and returned.

To use this function, you can simply call it on a sentence, and it will return the sentence with the stopwords removed. For example:

sentence = "This is an example sentence with stopwords." 

filtered_sentence = remove_stop_words(sentence) 
print(filtered_sentence) 
# Output: "example sentence stopwords."

In this case, the stopwords “this”, “is”, “an”, “with”, and “the” would be removed from the input sentence.

SpaCy stop words

To remove stopwords with spaCy, you can use the spacy.lang.en.stop_words.STOP_WORDS attribute to get a set of stopwords in English, and then use the token.is_stop attribute to check if a token is a stop word. Here is an example of how to remove stopwords using spaCy:

import spacy nlp = spacy.load('en_core_web_sm') 

# Create a set of stop words 
stop_words = spacy.lang.en.stop_words.STOP_WORDS 

# Define a function to remove stop words from a sentence 
def remove_stop_words(sentence): 
  # Parse the sentence using spaCy 
  doc = nlp(sentence) 
  
  # Use a list comprehension to remove stop words 
  filtered_tokens = [token for token in doc if not token.is_stop] 
  
  # Join the filtered tokens back into a sentence 
  return ' '.join([token.text for token in filtered_tokens])

In this example, SpaCy is used to parse a sentence and identify the individual tokens. A list comprehension is used to remove any tokens that are stopwords, and the filtered tokens are joined back into a sentence and returned.

To use this function, you can simply call it on a sentence and it will return the sentence with the stopwords removed. For example:

sentence = "This is an example sentence with stop words." 

filtered_sentence = remove_stop_words(sentence) 
print(filtered_sentence) 
# Output: "example sentence stop words."

In this case, the words “this”, “is”, “an”, “with”, and “the” would be removed from the input sentence.

Gensim stop words

To remove stop words using the gensim library, you can use the gensim.parsing.remove_stopwords method. This method takes a list of words as input and returns a list of words with the stopwords removed.

Here is an example of how to use this method to remove stopwords from a list of words:

from gensim.parsing.preprocessing import remove_stopwords 

# Define a list of words 
words = ["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"] 

# Remove the stop words 
filtered_words = remove_stopwords(words) 

# Print the filtered list of words 
print(filtered_words)
# Output: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']

As you can see, the “the” and “over” have been removed from the list. Note that the gensim library includes a default list of stopwords, but you can also specify your own custom list if needed.

Create a domain-specific stop words list

It can be incredibly useful to create your specific list of irrelevant words. For example, when analysing social media content, you will come across lots of irrelevant sequences of characters that you may not wish to analyse. Think of “RT”, a “re-tweet” on Twitter. You could add the “RT” to a specific stop word list and remove these characters automatically.

To make a domain-specific list, you need to figure out the most common words in your domain or subject area that don’t have much to do with your task. This usually means looking at a lot of text data from your domain and finding the words used most often. Once you have determined these words, you can add them to your stopword list.

To create your own stopword list in Python, you can simply define a list of strings containing the stopwords that you want to use. For example:

stop_words = ["a", "an", "the", "and", "but", "or", "because", "as", "until", "while", "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before", "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again", "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than", "too", "very", "can", "will", "just"]

You can then use this list in your code to remove them from text data. For example, if you have a string containing some text, you can use the .split() method to split the string into a list of words, and then use a for loop to iterate over the list of words and remove any words that are in the stop word list:

# Define a string containing some text 
text = "The quick brown fox jumps over the lazy dog." 

# Split the string into a list of words 
words = text.split() 

# Create a new list to hold the filtered words 
filtered_words = [] 

# Iterate over the list of words 
for word in words: 
  # If the word is not in the stop word list, add it to the filtered list 
  if word not in stop_words: 
    filtered_words.append(word) 
    
# Print the filtered list of words 
print(filtered_words)
# Output: ['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog.']

As you can see, the words from the list have been removed from the text. Note that you can use this technique with any list of stopwords, whether it is the default list in one of the libraries or your own.

Key Takeaways

  • Stop word removal is one of the top 10 most helpful NLP techniques. It’s a very useful technique to remove insignificant words from a data set with which you work. It can also improve the performance of your machine learning model and improve the interpretability of your results.
  • The disadvantages are that you could lose vital information, it’s not straightforward to decide which words to remove and you would need to maintain the list in the future.
  • It could be hard to scale this technique when working with multiple languages, as you must maintain a separate list per language used.

At Spot Intelligence, this is also one of our favourite techniques as it allows us to analyse a new data set quickly and run it through machine learning models without losing much of the interpretability you would get with other more advanced techniques like word embedding and sentence embedding.

What are your favourite pre-processing techniques to use in NLP? Let us know in the comments.

Related Articles

Understanding Elman RNN — Uniqueness & How To Implement

by | Feb 1, 2023 | artificial intelligence,Machine Learning,Natural Language Processing | 0 Comments

What is the Elman neural network? Elman Neural Network is a recurrent neural network (RNN) designed to capture and store contextual information in a hidden layer. Jeff...

Self-attention Made Easy And How To Implement It

by | Jan 31, 2023 | Machine Learning,Natural Language Processing | 0 Comments

What is self-attention in deep learning? Self-attention is a type of attention mechanism used in deep learning models, also known as the self-attention mechanism. It...

Gated Recurrent Unit Explained & How They Compare [LSTM, RNN, CNN]

by | Jan 30, 2023 | artificial intelligence,Machine Learning,Natural Language Processing | 0 Comments

What is a Gated Recurrent Unit? A Gated Recurrent Unit (GRU) is a Recurrent Neural Network (RNN) architecture type. It is similar to a Long Short-Term Memory (LSTM)...

How To Use The Top 9 Most Useful Text Normalization Techniques (NLP)

by | Jan 25, 2023 | Data Science,Natural Language Processing | 0 Comments

Text normalization is a key step in natural language processing (NLP). It involves cleaning and preprocessing text data to make it consistent and usable for different...

How To Implement POS Tagging In NLP Using Python

by | Jan 24, 2023 | Data Science,Natural Language Processing | 0 Comments

Part-of-speech (POS) tagging is fundamental in natural language processing (NLP) and can be carried out in Python. It involves labelling words in a sentence with their...

How To Start Using Transformers In Natural Language Processing

by | Jan 23, 2023 | Machine Learning,Natural Language Processing | 0 Comments

Transformers Implementations in TensorFlow, PyTorch, Hugging Face and OpenAI's GPT-3 What are transformers in natural language processing? Natural language processing...

How To Implement Different Question-Answering Systems In NLP

by | Jan 20, 2023 | artificial intelligence,Data Science,Natural Language Processing | 0 Comments

Question answering (QA) is a field of natural language processing (NLP) and artificial intelligence (AI) that aims to develop systems that can understand and answer...

The Curse Of Variability And How To Overcome It

by | Jan 20, 2023 | Data Science,Machine Learning,Natural Language Processing | 0 Comments

What is the curse of variability? The curse of variability refers to the idea that as the variability of a dataset increases, the difficulty of finding a good model...

How To Implement A Siamese Network In NLP — Made Easy

by | Jan 19, 2023 | Machine Learning,Natural Language Processing | 0 Comments

What is a Siamese network? It is also commonly known as one or a few-shot learning. They are popular because less labelled data is required to train them. Siamese...

Top 6 Most Popular Text Clustering Algorithms And How They Work

by | Jan 17, 2023 | Data Science,Machine Learning,Natural Language Processing | 0 Comments

What exactly is text clustering? The process of grouping a collection of texts into clusters based on how similar their content is is known as text clustering. Text...

Opinion Mining — More Powerful Than Just Sentiment Analysis

by | Jan 17, 2023 | Data Science,Natural Language Processing | 0 Comments

Opinion mining is a field that is growing quickly. It uses natural language processing and text analysis to gather subjective information from sources. The main goal of...

How To Implement Document Clustering In Python

by | Jan 16, 2023 | Data Science,Machine Learning,Natural Language Processing | 0 Comments

Introduction to document clustering and its importance Grouping similar documents together in Python based on their content is called document clustering, also known as...

Local Sensitive Hashing — When And How To Get Started

by | Jan 16, 2023 | Machine Learning,Natural Language Processing | 0 Comments

What is local sensitive hashing? A technique for performing a rough nearest neighbour search in high-dimensional spaces is called local sensitive hashing (LSH). It...

How To Get Started With One Hot Encoding

by | Jan 12, 2023 | Data Science,Machine Learning,Natural Language Processing | 0 Comments

Categorical variables are variables that can take on one of a limited number of values. These variables are commonly found in datasets and can't be used directly in...

Different Attention Mechanism In NLP Made Easy

by | Jan 12, 2023 | artificial intelligence,Machine Learning,Natural Language Processing | 0 Comments

Numerous tasks in natural language processing (NLP) depend heavily on an attention mechanism. When the data is being processed, they allow the model to focus on only...

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *