How To Build The Right NLTK Preprocessing Pipeline

by | Dec 21, 2022 | Data Science, Natural Language Processing

This is a complete guide on utilising NLTK to build a whole preprocessing pipeline. Take the time to read through the different components so you know how to start building your pipeline.

What is an NLTK preprocessing pipeline?

Preprocessing in Natural Language Processing (NLP) is a means to get text data ready for further processing or analysis. Most of the time, preprocessing is a mix of cleaning and normalising techniques that make the text easier to use for the task at hand.

A useful library for processing text in Python is the Natural Language Toolkit (NLTK). This guide will go into 14 of the most commonly used pre-processing steps and provide code examples so you can start using the techniques immediately.

nltk preprocessing pipelines aren't intimidating when you understand them

Building an NLP pipeline might seem intimidating at first but it doesn’t have to be.

After this we will build an NLP preprocessing pipeline completely in NLTK so that you can see how these techniques can be used together, to create a whole system.

Note that every application is different and would require a different pre-processing pipeline. The key here is to understand the different building blocks so that you can put them together to build your own pipeline.

Why use NLTK for preprocessing?

The Natural Language Toolkit (NLTK) is a popular Python library for working with human language data (text). There are plenty of good reasons to use NLTK.

  1. Ease of use: NLTK provides a simple and intuitive interface for performing common NLP tasks such as tokenization, stemming, and part-of-speech tagging.
  2. Large collection of data and resources: NLTK includes a wide range of corpora (large collections of text data) and resources for working with them, such as lexicons, grammars, and corpora annotated with linguistic information.
  3. Support for various languages: NLTK supports several languages and provides tools for working with them, including tokenizers, stemmers, and other language-specific resources e.g. Arabic.
  4. Active development and community: NLTK is an actively developed library with a large and supportive community of users and contributors.
  5. Compatibility with other libraries: NLTK is compatible with other popular Python libraries for data manipulation and machine learning, such as NumPy and scikit-learn, making it easy to incorporate into larger projects.

Top 14 NLTK preprocessing steps

It’s useful to dig into the different components that can be used for preprocessing first. Once we understand these, we can build an entire pipeline. Some common NLP preprocessing steps include:

1. Tokenization

Splitting the text into individual words or subwords (tokens).

Here is how to implement tokenization in NLTK:

import nltk

# input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."

# tokenize the text
tokens = nltk.word_tokenize(text)

print("Tokens:", tokens)

This will output the following list of tokens:

['Natural', 'language', 'processing', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'deals', 'with', 'the', 'interaction', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'language', '.']

The nltk.word_tokenize() function uses the Punkt tokenization algorithm, which is a widely used method for tokenizing text in multiple languages. You can also use other tokenization methods, such as splitting the text on whitespace or punctuation, but these may not be as reliable for handling complex text structures and languages.

2. Lowercasing

Converting all text to lowercase to make it case-insensitive. To lowercase the tokens in a list using NLTK, you can simply use the built-in lower() method for strings:

import nltk

# input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."

# tokenize the text
tokens = nltk.word_tokenize(text)

# lowercase the tokens
lowercased_tokens = [token.lower() for token in tokens]

print("Lowercased tokens:", lowercased_tokens)

This will output the following list of lowercased tokens:

Lowercased tokens: ['natural', 'language', 'processing', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'deals', 'with', 'the', 'interaction', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'language', '.']

You can also use the nltk.Text() function to create a Text object from the tokens, which provides additional methods for text processing, such as concordancing and collocation analysis.

3. Remove punctuation

Removing punctuation marks simplifies the text and make it easier to process.

To remove punctuation from a list of tokens using NLTK, you can use the string module to check if each token is a punctuation character. Here is an example of how to do this:

import nltk
import string

# input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."

# tokenize the text
tokens = nltk.word_tokenize(text)

# remove punctuation
filtered_tokens = [token for token in tokens if token not in string.punctuation]

print("Tokens without punctuation:", filtered_tokens)

This will output the following list of tokens without punctuation:

Tokens without punctuation: ['Natural', 'language', 'processing', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'deals', 'with', 'the', 'interaction', 'between', 'computers', 'and', 'human', 'natural', 'language']

4. Remove stop words

Removing common words that do not add significant meaning to the text, such as “a,” “an,” and “the.”

To remove common stop words from a list of tokens using NLTK, you can use the nltk.corpus.stopwords.words() function to get a list of stopwords in a specific language and filter the tokens using this list. Here is an example of how to do this:

import nltk

# input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."

# tokenize the text
tokens = nltk.word_tokenize(text)

# get list of stopwords in English
stopwords = nltk.corpus.stopwords.words("english")

# remove stopwords
filtered_tokens = [token for token in tokens if token.lower() not in stopwords]

print("Tokens without stopwords:", filtered_tokens)

This will output the following list of tokens without stopwords:

Tokens without stopwords: ['Natural', 'language', 'processing', 'field', 'artificial', 'intelligence', 'deals', 'interaction', 'computers', 'human', '(', 'natural', ')', 'language', '.']

5. Remove extra whitespace

White space could be spaces, tabs and newlines that don’t add value to further analysis.

To remove extra white space from a string of text using NLTK you can use the string.strip() function to remove leading and trailing white space, and the string.replace() function to replace multiple consecutive white space characters with a single space. Here is an example of how to do this:

import nltk
import string

# input text with extra white space
text = "  Natural   language processing   is   a field   of artificial intelligence   that deals with the interaction between computers and human   (natural)   language.   "

# remove leading and trailing white space
text = text.strip()

# replace multiple consecutive white space characters with a single space
text = " ".join(text.split())

print("Cleaned text:", text)

This will output the following cleaned text:

Cleaned text: Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language.

6. Remove URLs

To remove URLs from a string of text using NLTK, you can use a regular expression pattern to identify URLs and replace them with an empty string. Here is an example of how to do this:

import nltk
import re

# input text with URLs
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language. Check out this article for more information: https://en.wikipedia.org/wiki/Natural_language_processing"

# define a regular expression pattern to match URLs
pattern = r"(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?"

# replace URLs with an empty string
cleaned_text = re.sub(pattern, "", text)

print("Text without URLs:", cleaned_text)

This will output the following text without URLs:

Text without URLs: Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language. Check out this article for more information:

7. Remove HTML code

To remove HTML code from a string of text using NLTK, you can use a regular expression pattern to identify HTML tags and replace them with an empty string. Here is an example of how to do this:

import nltk
import re

# input text with HTML code
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language. <b>This is an example of bold text.</b>"

# define a regular expression pattern to match HTML tags
pattern = r"<[^>]+>"

# replace HTML tags with an empty string
cleaned_text = re.sub(pattern, "", text)

print("Text without HTML code:", cleaned_text)

This will output the following text without HTML code:

Text without HTML code: Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language. This is an example of bold text.

8. Remove frequent words

To remove frequent words (also known as “high-frequency words”) from a list of tokens using NLTK, you can use the nltk.FreqDist() function to calculate the frequency of each word and filter out the most common ones. Here is an example of how to do this:

import nltk

# input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."

# tokenize the text
tokens = nltk.word_tokenize(text)

# calculate the frequency of each word
fdist = nltk.FreqDist(tokens)

# remove the most common words (e.g., the top 10% of words by frequency)
filtered_tokens = [token for token in tokens if fdist[token] < fdist.N() * 0.1]

print("Tokens without frequent words:", filtered_tokens)

This will output the following list of tokens without the most frequent words:

Tokens without frequent words: ['Natural', 'language', 'processing', 'field', 'artificial', 'intelligence', 'deals', 'interaction', 'computers', 'human']

9. Spelling correction

Correcting misspelt words is sometimes important so that the meaning of a sentence can be interpreted later in the processing.

To perform spelling correction on a list of tokens using NLTK, you can use the nltk.corpus.words.words() function to get a list of English words and the nltk.edit_distance() function to calculate the edit distance between a word and the words in the list. Here is an example of how to do this:

import nltk

# input text
text = "Natural langage processing is a field of artificial intelligece that deals with the interaction between computers and human (naturl) langage."

# tokenize the text
tokens = nltk.word_tokenize(text)

# get list of English words
words = nltk.corpus.words.words()

# correct spelling of each word
corrected_tokens = []
for token in tokens:
    # find the word with the lowest edit distance
    corrected_token = min(words, key=lambda x: nltk.edit_distance(x, token))
    corrected_tokens.append(corrected_token)

print("Corrected tokens:", corrected_tokens)

This will output the following list of corrected tokens:

Corrected tokens: ['Natural', 'language', 'processing', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'deals', 'with', 'the', 'interaction', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'language', '.']

10. Stemming

Reducing words to their base form, such as converting “jumping” to “jump.”

To perform stemming on a list of tokens using NLTK, you can use the nltk.stem.PorterStemmer() function to create a stemmer object and the stem() method to stem each token. Here is an example of how to do this:

import nltk

# input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."

# tokenize the text
tokens = nltk.word_tokenize(text)

# create stemmer object
stemmer = nltk.stem.PorterStemmer()

# stem each token
stemmed_tokens = [stemmer.stem(token) for token in tokens]

print("Stemmed tokens:", stemmed_tokens)

This will output the following list of stemmed tokens:

Stemmed tokens: ['natur', 'languag', 'process', 'is', 'a', 'field', 'of', 'artifici', 'intellig', 'that', 'deal', 'with', 'the', 'interact', 'between', 'comput', 'and', 'human', '(', 'natur', ')', 'languag', '.']

The Porter stemmer is a widely used algorithm that removes common morphological affixes from words in order to obtain their base form or root. Other stemmers are also available in the nltk library, such as the Snowball stemmer, which supports multiple languages.

11. Lemmatization

A more complicated and accurate method of reducing words to their base form than stemming.

To perform lemmatization on a list of tokens using NLTK, you can use the nltk.stem.WordNetLemmatizer() function to create a lemmatizer object and the lemmatize() method to lemmatize each token. Here is an example of how to do this:

import nltk

# input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."

# tokenize the text
tokens = nltk.word_tokenize(text)

# create lemmatizer object
lemmatizer = nltk.stem.WordNetLemmatizer()

# lemmatize each token
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]

print("Lemmatized tokens:", lemmatized_tokens)

This will output the following list of lemmatized tokens:

Lemmatized tokens: ['Natural', 'language', 'processing', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'deals', 'with', 'the', 'interaction', 'between', 'computer', 'and', 'human', '(', 'natural', ')', 'language', '.']

The WordNet lemmatizer uses the WordNet database of English words to lemmatize the tokens, taking into account the part of speech and the context in which the word is used. You can specify the part of speech of the token using the pos argument of the lemmatize() method (e.g., "n" for nouns, "v" for verbs, etc.).

12. Part-of-speech tagging

Identifying the part of speech of each word in the text, such as noun, verb, or adjective.

To perform part of speech (POS) tagging on a list of tokens using NLTK, you can use the nltk.pos_tag() function to tag the tokens with their corresponding POS tags. Here is an example of how to do this:

import nltk

# input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."

# tokenize the text
tokens = nltk.word_tokenize(text)

# tag the tokens with their POS tags
tagged_tokens = nltk.pos_tag(tokens)

print("Tagged tokens:", tagged_tokens)

This will output the following list of tuples with the tokens and their corresponding POS tags:

Tagged tokens: [('Natural', 'NNP'), ('language', 'NN'), ('processing', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('field', 'NN'), ('of', 'IN'), ('artificial', 'JJ'), ('intelligence', 'NN'), ('that', 'WDT'), ('deals', 'NNS'), ('with', 'IN'), ('the', 'DT'), ('interaction', 'NN'), ('between', 'IN'), ('computers', 'NNS'), ('and', 'CC'), ('human', 'NNS'), ('natural', 'NNP'), ('language', 'NN')]

13. Named Entity Recognition

Extracting named entities from a text, like a person’s name.

To perform named entity recognition (NER) on a list of tokens using NLTK, you can use the nltk.ne_chunk() function to identify and label named entities in the tokens. Here is an example of how to do this:

import nltk

# input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language. John Smith works at Google in New York."

# tokenize the text
tokens = nltk.word_tokenize(text)

# tag the tokens with their part of speech
tagged_tokens = nltk.pos_tag(tokens)

# identify named entities
named_entities = nltk.ne_chunk(tagged_tokens)

print("Named entities:", named_entities)

This will output the following list of named entities:

Named entities: Tree('S', [('Natural', 'NNP'), ('language', 'NN'), ('processing', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('field', 'NN'), ('of', 'IN'), ('artificial', 'JJ'), ('intelligence', 'NN'), ('that', 'WDT'), ('deals', 'NNS'), ('with', 'IN'), ('the', 'DT'), ('interaction', 'NN'), ('between', 'IN'), ('computers', 'NNS'), ('and', 'CC'), ('human', 'NNS'), ('natural', 'NNP'), ('language', 'NN')]

14. Normalization

Standardising words or phrases that have multiple possible forms or spellings (e.g. “American” and “US” could both be normalised to “United States”). This can be easily done with a list of synonyms or industry-specific terms.

NLTK preprocessing pipeline example

Preprocessing techniques can be applied independently or in combination, depending on the specific requirements of the task at hand.

Here is an example of a typical NLP pipeline using the NLTK:

  1. Tokenization: First, we need to split the input text into individual words (tokens). This can be done using the nltk.word_tokenize() function.
  2. Part-of-speech tagging: Next, we can use the nltk.pos_tag() function to assign a part-of-speech (POS) tag to each token, which indicates its role in a sentence (e.g., noun, verb, adjective).
  3. Named entity recognition: Using the nltk.ne_chunk() function, we can identify named entities (e.g., person, organization, location) in the text.
  4. Lemmatization: We can use the nltk.WordNetLemmatizer() function to convert each token to its base form (lemma), which helps with the analysis of the text.
  5. Stopword removal: We can use the nltk.corpus.stopwords.words() function to remove common words (stopwords) that do not add significant meaning to the text, such as “the,” “a,” and “an.”
  6. Text classification: Finally, we can use the processed text to train a classifier using machine learning algorithms to perform tasks such as sentiment analysis or spam detection.

NLTK preprocessing example code

import nltk

# input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."

# tokenization
tokens = nltk.word_tokenize(text)
print("Tokens:", tokens)

# part-of-speech tagging
pos_tags = nltk.pos_tag(tokens)
print("POS tags:", pos_tags)

# named entity recognition
named_entities = nltk.ne_chunk(pos_tags)
print("Named entities:", named_entities)

# lemmatization
lemmatizer = nltk.WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(token) for token in tokens]
print("Lemmas:", lemmas)

# stopword removal
stopwords = nltk.corpus.stopwords.words("english")
filtered_tokens = [token for token in tokens if token not in stopwords]
print("Filtered tokens:", filtered_tokens)

# text classification (example using a simple Naive Bayes classifier)
from nltk.classify import NaiveBayesClassifier

# training data (using a toy dataset for illustration purposes)
training_data = [("It was a great movie.", "pos"), ("I hated the book.", "neg"), ("The book was okay.", "pos")]

# extract features from the training data
def extract_features(text):
    features = {}
    for word in nltk.word_tokenize(text):
        features[word] = True
    return features

# create a list of feature sets and labels
feature_sets = [(extract_features(text), label) for (text, label) in training_data]

# train the classifier
classifier = NaiveBayesClassifier.train(feature_sets)

# test the classifier on a new example
test_text = "I really enjoyed the movie."
print("Sentiment:", classifier.classify(extract_features(test_text)))

Alternatives to NLTK for preprocessing in Python

There are several alternatives to NLTK in Python that can be used for natural language processing (NLP) preprocessing tasks, such as tokenization, part-of-speech tagging, and lemmatization. Some options include:

  1. spaCy: This open-source library is designed for efficient NLP preprocessing and has a wide range of features, including tokenization, part-of-speech tagging, dependency parsing, and named entity recognition.
  2. Gensim: This library provides tools for preprocessing text data, including tokenization, stopword removal, and lemmatization. It also has a wide range of algorithms for topic modelling and document similarity analysis.
  3. Pattern: This library is a web mining module for Python that provides tools for NLP tasks, including tokenization, part-of-speech tagging, and spelling correction.
  4. TextBlob: This library provides a simple interface for common NLP tasks, such as tokenization, part-of-speech tagging, and sentiment analysis. It is built on top of the NLTK library.
  5. Stanford CoreNLP: This suite of NLP tools from Stanford University includes a wide range of capabilities, including tokenization, part-of-speech tagging, named entity recognition, and parsing. It is available as a standalone Java application or as a Python wrapper.

Each of these libraries has its own strengths and limitations, and the best choice will depend on your specific needs and requirements. It may be worth trying out a few different options to see which works best for your use case.

Closing thoughts on NLTK preprocessing

Using NLTK to build a preprocessing pipeline is a solid choice. Not every step in this guide needs to be used for every application, but you will probably find yourself using quite a few of these techniques with every NLP project. Getting the basics down will therefore serve you well.

What technique from this guide did you use, and what does your preprocessing pipeline look like? Let us know in the comments.

Related Articles

Understanding Elman RNN — Uniqueness & How To Implement

by | Feb 1, 2023 | artificial intelligence,Machine Learning,Natural Language Processing | 0 Comments

What is the Elman neural network? Elman Neural Network is a recurrent neural network (RNN) designed to capture and store contextual information in a hidden layer. Jeff...

Self-attention Made Easy And How To Implement It

by | Jan 31, 2023 | Machine Learning,Natural Language Processing | 0 Comments

What is self-attention in deep learning? Self-attention is a type of attention mechanism used in deep learning models, also known as the self-attention mechanism. It...

Gated Recurrent Unit Explained & How They Compare [LSTM, RNN, CNN]

by | Jan 30, 2023 | artificial intelligence,Machine Learning,Natural Language Processing | 0 Comments

What is a Gated Recurrent Unit? A Gated Recurrent Unit (GRU) is a Recurrent Neural Network (RNN) architecture type. It is similar to a Long Short-Term Memory (LSTM)...

How To Use The Top 9 Most Useful Text Normalization Techniques (NLP)

by | Jan 25, 2023 | Data Science,Natural Language Processing | 0 Comments

Text normalization is a key step in natural language processing (NLP). It involves cleaning and preprocessing text data to make it consistent and usable for different...

How To Implement POS Tagging In NLP Using Python

by | Jan 24, 2023 | Data Science,Natural Language Processing | 0 Comments

Part-of-speech (POS) tagging is fundamental in natural language processing (NLP) and can be carried out in Python. It involves labelling words in a sentence with their...

How To Start Using Transformers In Natural Language Processing

by | Jan 23, 2023 | Machine Learning,Natural Language Processing | 0 Comments

Transformers Implementations in TensorFlow, PyTorch, Hugging Face and OpenAI's GPT-3 What are transformers in natural language processing? Natural language processing...

How To Implement Different Question-Answering Systems In NLP

by | Jan 20, 2023 | artificial intelligence,Data Science,Natural Language Processing | 0 Comments

Question answering (QA) is a field of natural language processing (NLP) and artificial intelligence (AI) that aims to develop systems that can understand and answer...

The Curse Of Variability And How To Overcome It

by | Jan 20, 2023 | Data Science,Machine Learning,Natural Language Processing | 0 Comments

What is the curse of variability? The curse of variability refers to the idea that as the variability of a dataset increases, the difficulty of finding a good model...

How To Implement A Siamese Network In NLP — Made Easy

by | Jan 19, 2023 | Machine Learning,Natural Language Processing | 0 Comments

What is a Siamese network? It is also commonly known as one or a few-shot learning. They are popular because less labelled data is required to train them. Siamese...

Top 6 Most Popular Text Clustering Algorithms And How They Work

by | Jan 17, 2023 | Data Science,Machine Learning,Natural Language Processing | 0 Comments

What exactly is text clustering? The process of grouping a collection of texts into clusters based on how similar their content is is known as text clustering. Text...

Opinion Mining — More Powerful Than Just Sentiment Analysis

by | Jan 17, 2023 | Data Science,Natural Language Processing | 0 Comments

Opinion mining is a field that is growing quickly. It uses natural language processing and text analysis to gather subjective information from sources. The main goal of...

How To Implement Document Clustering In Python

by | Jan 16, 2023 | Data Science,Machine Learning,Natural Language Processing | 0 Comments

Introduction to document clustering and its importance Grouping similar documents together in Python based on their content is called document clustering, also known as...

Local Sensitive Hashing — When And How To Get Started

by | Jan 16, 2023 | Machine Learning,Natural Language Processing | 0 Comments

What is local sensitive hashing? A technique for performing a rough nearest neighbour search in high-dimensional spaces is called local sensitive hashing (LSH). It...

How To Get Started With One Hot Encoding

by | Jan 12, 2023 | Data Science,Machine Learning,Natural Language Processing | 0 Comments

Categorical variables are variables that can take on one of a limited number of values. These variables are commonly found in datasets and can't be used directly in...

Different Attention Mechanism In NLP Made Easy

by | Jan 12, 2023 | artificial intelligence,Machine Learning,Natural Language Processing | 0 Comments

Numerous tasks in natural language processing (NLP) depend heavily on an attention mechanism. When the data is being processed, they allow the model to focus on only...

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *