Top 14 Steps To Build A Complete NLTK Preprocessing Pipeline In Python

by Neri Van Otten | Dec 21, 2022 | Data Science, Natural Language Processing

This is a complete guide on utilising NLTK to build a whole preprocessing pipeline. Take the time to read through the different components so you know how to start building your pipeline.

What is an NLTK preprocessing pipeline?

Preprocessing in Natural Language Processing (NLP) is a means to get text data ready for further processing or analysis. Most of the time, preprocessing is a mix of cleaning and normalising techniques that make the text easier to use for the task at hand.

A useful library for processing text in Python is the Natural Language Toolkit (NLTK). This guide will go into 14 of the most commonly used pre-processing steps and provide code examples so you can start using the techniques immediately.

nltk preprocessing pipelines aren't intimidating when you understand them

Building an NLP pipeline might seem intimidating at first but it doesn’t have to be.

After this we will build an NLP preprocessing pipeline completely in NLTK so that you can see how these techniques can be used together, to create a whole system.

Note that every application is different and would require a different pre-processing pipeline. The key here is to understand the different building blocks so that you can put them together to build your own pipeline.

Why use NLTK for preprocessing?

The Natural Language Toolkit (NLTK) is a popular Python library for working with human language data (text). There are plenty of good reasons to use NLTK.

Ease of use: NLTK provides a simple and intuitive interface for performing common NLP tasks such as tokenization, stemming, and part-of-speech tagging.
Large collection of data and resources: NLTK includes a wide range of corpora (large collections of text data) and resources for working with them, such as lexicons, grammars, and corpora annotated with linguistic information.
Support for various languages: NLTK supports several languages and provides tools for working with them, including tokenizers, stemmers, and other language-specific resources e.g. Arabic.
Active development and community: NLTK is an actively developed library with a large and supportive community of users and contributors.
Compatibility with other libraries: NLTK is compatible with other popular Python libraries for data manipulation and machine learning, such as NumPy and scikit-learn, making it easy to incorporate into larger projects.

Top 14 NLTK preprocessing steps

It’s useful to dig into the different components that can be used for preprocessing first. Once we understand these, we can build an entire pipeline. Some common NLP preprocessing steps include:

1. Tokenization

Splitting the text into individual words or subwords (tokens).

Here is how to implement tokenization in NLTK:

import nltk

# input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."

# tokenize the text
tokens = nltk.word_tokenize(text)

print("Tokens:", tokens)

This will output the following list of tokens:

['Natural', 'language', 'processing', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'deals', 'with', 'the', 'interaction', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'language', '.']

The nltk.word_tokenize() function uses the Punkt tokenization algorithm, which is a widely used method for tokenizing text in multiple languages. You can also use other tokenization methods, such as splitting the text on whitespace or punctuation, but these may not be as reliable for handling complex text structures and languages.

2. Lowercasing

Converting all text to lowercase to make it case-insensitive. To lowercase the tokens in a list using NLTK, you can simply use the built-in lower() method for strings:

import nltk

# input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."

# tokenize the text
tokens = nltk.word_tokenize(text)

# lowercase the tokens
lowercased_tokens = [token.lower() for token in tokens]

print("Lowercased tokens:", lowercased_tokens)

This will output the following list of lowercased tokens:

Lowercased tokens: ['natural', 'language', 'processing', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'deals', 'with', 'the', 'interaction', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'language', '.']

You can also use the nltk.Text() function to create a Text object from the tokens, which provides additional methods for text processing, such as concordancing and collocation analysis.

3. Remove punctuation

Removing punctuation marks simplifies the text and make it easier to process.

To remove punctuation from a list of tokens using NLTK, you can use the string module to check if each token is a punctuation character. Here is an example of how to do this:

import nltk
import string

# input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."

# tokenize the text
tokens = nltk.word_tokenize(text)

# remove punctuation
filtered_tokens = [token for token in tokens if token not in string.punctuation]

print("Tokens without punctuation:", filtered_tokens)

This will output the following list of tokens without punctuation:

Tokens without punctuation: ['Natural', 'language', 'processing', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'deals', 'with', 'the', 'interaction', 'between', 'computers', 'and', 'human', 'natural', 'language']

4. Remove stop words

Removing common words that do not add significant meaning to the text, such as “a,” “an,” and “the.”

To remove common stop words from a list of tokens using NLTK, you can use the nltk.corpus.stopwords.words() function to get a list of stopwords in a specific language and filter the tokens using this list. Here is an example of how to do this:

import nltk

# input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."

# tokenize the text
tokens = nltk.word_tokenize(text)

# get list of stopwords in English
stopwords = nltk.corpus.stopwords.words("english")

# remove stopwords
filtered_tokens = [token for token in tokens if token.lower() not in stopwords]

print("Tokens without stopwords:", filtered_tokens)

This will output the following list of tokens without stopwords:

Tokens without stopwords: ['Natural', 'language', 'processing', 'field', 'artificial', 'intelligence', 'deals', 'interaction', 'computers', 'human', '(', 'natural', ')', 'language', '.']

5. Remove extra whitespace

White space could be spaces, tabs and newlines that don’t add value to further analysis.

To remove extra white space from a string of text using NLTK you can use the string.strip() function to remove leading and trailing white space, and the string.replace() function to replace multiple consecutive white space characters with a single space. Here is an example of how to do this:

import nltk
import string

# input text with extra white space
text = "  Natural   language processing   is   a field   of artificial intelligence   that deals with the interaction between computers and human   (natural)   language.   "

# remove leading and trailing white space
text = text.strip()

# replace multiple consecutive white space characters with a single space
text = " ".join(text.split())

print("Cleaned text:", text)

This will output the following cleaned text:

Cleaned text: Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language.

6. Remove URLs

To remove URLs from a string of text using NLTK, you can use a regular expression pattern to identify URLs and replace them with an empty string. Here is an example of how to do this:

import nltk
import re

# input text with URLs
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language. Check out this article for more information: https://en.wikipedia.org/wiki/Natural_language_processing"

# define a regular expression pattern to match URLs
pattern = r"(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?"

# replace URLs with an empty string
cleaned_text = re.sub(pattern, "", text)

print("Text without URLs:", cleaned_text)

This will output the following text without URLs:

Text without URLs: Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language. Check out this article for more information:

7. Remove HTML code

To remove HTML code from a string of text using NLTK, you can use a regular expression pattern to identify HTML tags and replace them with an empty string. Here is an example of how to do this:

import nltk
import re

# input text with HTML code
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language. <b>This is an example of bold text.</b>"

# define a regular expression pattern to match HTML tags
pattern = r"<[^>]+>"

# replace HTML tags with an empty string
cleaned_text = re.sub(pattern, "", text)

print("Text without HTML code:", cleaned_text)

This will output the following text without HTML code:

Text without HTML code: Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language. This is an example of bold text.

8. Remove frequent words

To remove frequent words (also known as “high-frequency words”) from a list of tokens using NLTK, you can use the nltk.FreqDist() function to calculate the frequency of each word and filter out the most common ones. Here is an example of how to do this:

See also  Temporal Difference Learning Made Simple With Example & Alternatives

print("Tokens without frequent words:", filtered_tokens)" style="color:#000000;display:none" aria-label="Copy" class="code-block-pro-copy-button">

import nltk

# input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."

# tokenize the text
tokens = nltk.word_tokenize(text)

# calculate the frequency of each word
fdist = nltk.FreqDist(tokens)

# remove the most common words (e.g., the top 10% of words by frequency)
filtered_tokens = [token for token in tokens if fdist[token] < fdist.N() * 0.1]

print("Tokens without frequent words:", filtered_tokens)

This will output the following list of tokens without the most frequent words:

Tokens without frequent words: ['Natural', 'language', 'processing', 'field', 'artificial', 'intelligence', 'deals', 'interaction', 'computers', 'human']

9. Spelling correction

Correcting misspelt words is sometimes important so that the meaning of a sentence can be interpreted later in the processing.

To perform spelling correction on a list of tokens using NLTK, you can use the nltk.corpus.words.words() function to get a list of English words and the nltk.edit_distance() function to calculate the edit distance between a word and the words in the list. Here is an example of how to do this:

import nltk

# input text
text = "Natural langage processing is a field of artificial intelligece that deals with the interaction between computers and human (naturl) langage."

# tokenize the text
tokens = nltk.word_tokenize(text)

# get list of English words
words = nltk.corpus.words.words()

# correct spelling of each word
corrected_tokens = []
for token in tokens:
    # find the word with the lowest edit distance
    corrected_token = min(words, key=lambda x: nltk.edit_distance(x, token))
    corrected_tokens.append(corrected_token)

print("Corrected tokens:", corrected_tokens)

This will output the following list of corrected tokens:

Corrected tokens: ['Natural', 'language', 'processing', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'deals', 'with', 'the', 'interaction', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'language', '.']

10. Stemming

Reducing words to their base form, such as converting “jumping” to “jump.”

To perform stemming on a list of tokens using NLTK, you can use the nltk.stem.PorterStemmer() function to create a stemmer object and the stem() method to stem each token. Here is an example of how to do this:

import nltk

# input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."

# tokenize the text
tokens = nltk.word_tokenize(text)

# create stemmer object
stemmer = nltk.stem.PorterStemmer()

# stem each token
stemmed_tokens = [stemmer.stem(token) for token in tokens]

print("Stemmed tokens:", stemmed_tokens)

This will output the following list of stemmed tokens:

Stemmed tokens: ['natur', 'languag', 'process', 'is', 'a', 'field', 'of', 'artifici', 'intellig', 'that', 'deal', 'with', 'the', 'interact', 'between', 'comput', 'and', 'human', '(', 'natur', ')', 'languag', '.']

The Porter stemmer is a widely used algorithm that removes common morphological affixes from words in order to obtain their base form or root. Other stemmers are also available in the nltk library, such as the Snowball stemmer, which supports multiple languages.

11. Lemmatization

A more complicated and accurate method of reducing words to their base form than stemming.

To perform lemmatization on a list of tokens using NLTK, you can use the nltk.stem.WordNetLemmatizer() function to create a lemmatizer object and the lemmatize() method to lemmatize each token. Here is an example of how to do this:

import nltk

# input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."

# tokenize the text
tokens = nltk.word_tokenize(text)

# create lemmatizer object
lemmatizer = nltk.stem.WordNetLemmatizer()

# lemmatize each token
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]

print("Lemmatized tokens:", lemmatized_tokens)

This will output the following list of lemmatized tokens:

Lemmatized tokens: ['Natural', 'language', 'processing', 'is', 'a', 'field', 'of', 'artificial', 'intelligence', 'that', 'deals', 'with', 'the', 'interaction', 'between', 'computer', 'and', 'human', '(', 'natural', ')', 'language', '.']

The WordNet lemmatizer uses the WordNet database of English words to lemmatize the tokens, taking into account the part of speech and the context in which the word is used. You can specify the part of speech of the token using the pos argument of the lemmatize() method (e.g., "n" for nouns, "v" for verbs, etc.).

12. Part-of-speech tagging

Identifying the part of speech of each word in the text, such as noun, verb, or adjective.

To perform part of speech (POS) tagging on a list of tokens using NLTK, you can use the nltk.pos_tag() function to tag the tokens with their corresponding POS tags. Here is an example of how to do this:

import nltk

# input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."

# tokenize the text
tokens = nltk.word_tokenize(text)

# tag the tokens with their POS tags
tagged_tokens = nltk.pos_tag(tokens)

print("Tagged tokens:", tagged_tokens)

This will output the following list of tuples with the tokens and their corresponding POS tags:

Tagged tokens: [('Natural', 'NNP'), ('language', 'NN'), ('processing', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('field', 'NN'), ('of', 'IN'), ('artificial', 'JJ'), ('intelligence', 'NN'), ('that', 'WDT'), ('deals', 'NNS'), ('with', 'IN'), ('the', 'DT'), ('interaction', 'NN'), ('between', 'IN'), ('computers', 'NNS'), ('and', 'CC'), ('human', 'NNS'), ('natural', 'NNP'), ('language', 'NN')]

13. Named Entity Recognition

Extracting named entities from a text, like a person’s name.

To perform named entity recognition (NER) on a list of tokens using NLTK, you can use the nltk.ne_chunk() function to identify and label named entities in the tokens. Here is an example of how to do this:

import nltk

# input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language. John Smith works at Google in New York."

# tokenize the text
tokens = nltk.word_tokenize(text)

# tag the tokens with their part of speech
tagged_tokens = nltk.pos_tag(tokens)

# identify named entities
named_entities = nltk.ne_chunk(tagged_tokens)

print("Named entities:", named_entities)

This will output the following list of named entities:

Named entities: Tree('S', [('Natural', 'NNP'), ('language', 'NN'), ('processing', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('field', 'NN'), ('of', 'IN'), ('artificial', 'JJ'), ('intelligence', 'NN'), ('that', 'WDT'), ('deals', 'NNS'), ('with', 'IN'), ('the', 'DT'), ('interaction', 'NN'), ('between', 'IN'), ('computers', 'NNS'), ('and', 'CC'), ('human', 'NNS'), ('natural', 'NNP'), ('language', 'NN')]

14. Normalization

Standardising words or phrases that have multiple possible forms or spellings (e.g. “American” and “US” could both be normalised to “United States”). This can be easily done with a list of synonyms or industry-specific terms.

NLTK preprocessing pipeline example

Preprocessing techniques can be applied independently or in combination, depending on the specific requirements of the task at hand.

Here is an example of a typical NLP pipeline using the NLTK:

Tokenization: First, we need to split the input text into individual words (tokens). This can be done using the nltk.word_tokenize() function.
Part-of-speech tagging: Next, we can use the nltk.pos_tag() function to assign a part-of-speech (POS) tag to each token, which indicates its role in a sentence (e.g., noun, verb, adjective).
Named entity recognition: Using the nltk.ne_chunk() function, we can identify named entities (e.g., person, organization, location) in the text.
Lemmatization: We can use the nltk.WordNetLemmatizer() function to convert each token to its base form (lemma), which helps with the analysis of the text.
Stopword removal: We can use the nltk.corpus.stopwords.words() function to remove common words (stopwords) that do not add significant meaning to the text, such as “the,” “a,” and “an.”
Text classification: Finally, we can use the processed text to train a classifier using machine learning algorithms to perform tasks such as sentiment analysis or spam detection.

NLTK preprocessing example code

import nltk

# input text
text = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."

# tokenization
tokens = nltk.word_tokenize(text)
print("Tokens:", tokens)

# part-of-speech tagging
pos_tags = nltk.pos_tag(tokens)
print("POS tags:", pos_tags)

# named entity recognition
named_entities = nltk.ne_chunk(pos_tags)
print("Named entities:", named_entities)

# lemmatization
lemmatizer = nltk.WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(token) for token in tokens]
print("Lemmas:", lemmas)

# stopword removal
stopwords = nltk.corpus.stopwords.words("english")
filtered_tokens = [token for token in tokens if token not in stopwords]
print("Filtered tokens:", filtered_tokens)

# text classification (example using a simple Naive Bayes classifier)
from nltk.classify import NaiveBayesClassifier

# training data (using a toy dataset for illustration purposes)
training_data = [("It was a great movie.", "pos"), ("I hated the book.", "neg"), ("The book was okay.", "pos")]

# extract features from the training data
def extract_features(text):
    features = {}
    for word in nltk.word_tokenize(text):
        features[word] = True
    return features

# create a list of feature sets and labels
feature_sets = [(extract_features(text), label) for (text, label) in training_data]

# train the classifier
classifier = NaiveBayesClassifier.train(feature_sets)

# test the classifier on a new example
test_text = "I really enjoyed the movie."
print("Sentiment:", classifier.classify(extract_features(test_text)))

Alternatives to NLTK for preprocessing in Python

There are several alternatives to NLTK in Python that can be used for natural language processing (NLP) preprocessing tasks, such as tokenization, part-of-speech tagging, and lemmatization. Some options include:

SpaCy: This open-source library is designed for efficient NLP preprocessing and has a wide range of features, including tokenization, part-of-speech tagging, dependency parsing, and named entity recognition.
Gensim: This library provides tools for preprocessing text data, including tokenization, stopword removal, and lemmatization. It also has a wide range of algorithms for topic modelling and document similarity analysis.
Pattern: This library is a web mining module for Python that provides tools for NLP tasks, including tokenization, part-of-speech tagging, and spelling correction.
TextBlob: This library provides a simple interface for common NLP tasks, such as tokenization, part-of-speech tagging, and sentiment analysis. It is built on top of the NLTK library.
Stanford CoreNLP: This suite of NLP tools from Stanford University includes a wide range of capabilities, including tokenization, part-of-speech tagging, named entity recognition, and parsing. It is available as a standalone Java application or as a Python wrapper.

Each of these libraries has its own strengths and limitations, and the best choice will depend on your specific needs and requirements. It may be worth trying out a few different options to see which works best for your use case.

Closing thoughts on NLTK preprocessing

Using NLTK to build a preprocessing pipeline is a solid choice. Not every step in this guide needs to be used for every application, but you will probably find yourself using quite a few of these techniques with every NLP project. Getting the basics down will therefore serve you well.

What technique from this guide did you use, and what does your preprocessing pipeline look like? Let us know in the comments.

About the Author

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.

Meet Neri

Neri Van Otten is a machine learning and software engineer with over 12 years of Natural Language Processing (NLP) experience. Dedicated to making your projects succeed.