This is a complete guide on utilising NLTK to build a whole preprocessing pipeline. Take the time to read through the different components so you know how to start building your pipeline.
A useful library for processing text in Python is the Natural Language Toolkit (NLTK). This guide will go into 14 of the most commonly used pre-processing steps and provide code examples so you can start using the techniques immediately.
Building an NLP pipeline might seem intimidating at first but it doesn’t have to be.
After this we will build an NLP preprocessing pipeline completely in NLTK so that you can see how these techniques can be used together, to create a whole system.
Note that every application is different and would require a different pre-processing pipeline. The key here is to understand the different building blocks so that you can put them together to build your own pipeline.
Why use NLTK for preprocessing?
The Natural Language Toolkit (NLTK) is a popular Python library for working with human language data (text). There are plenty of good reasons to use NLTK.
Ease of use: NLTK provides a simple and intuitive interface for performing common NLP tasks such as tokenization, stemming, and part-of-speech tagging.
Large collection of data and resources: NLTK includes a wide range of corpora (large collections of text data) and resources for working with them, such as lexicons, grammars, and corpora annotated with linguistic information.
Support for various languages: NLTK supports several languages and provides tools for working with them, including tokenizers, stemmers, and other language-specific resources e.g. Arabic.
Active development and community: NLTK is an actively developed library with a large and supportive community of users and contributors.
Compatibility with other libraries: NLTK is compatible with other popular Python libraries for data manipulation and machine learning, such as NumPy and scikit-learn, making it easy to incorporate into larger projects.
Top 14 NLTK preprocessing steps
It’s useful to dig into the different components that can be used for preprocessing first. Once we understand these, we can build an entire pipeline. Some common NLP preprocessing steps include:
Splitting the text into individual words or subwords (tokens).
Here is how to implement tokenization in NLTK:
import nltk# input texttext = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."# tokenize the texttokens = nltk.word_tokenize(text)print("Tokens:", tokens)
The
nltk.word_tokenize()
function uses the Punkt tokenization algorithm, which is a widely used method for tokenizing text in multiple languages. You can also use other tokenization methods, such as splitting the text on whitespace or punctuation, but these may not be as reliable for handling complex text structures and languages.
2. Lowercasing
Converting all text to lowercase to make it case-insensitive. To lowercase the tokens in a list using NLTK, you can simply use the built-in
lower()
method for strings:
import nltk# input texttext = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."# tokenize the texttokens = nltk.word_tokenize(text)# lowercase the tokenslowercased_tokens = [token.lower() for token in tokens]print("Lowercased tokens:", lowercased_tokens)
This will output the following list of lowercased tokens:
You can also use the
nltk.Text()
function to create a Text object from the tokens, which provides additional methods for text processing, such as concordancing and collocation analysis.
3. Remove punctuation
Removing punctuation marks simplifies the text and make it easier to process.
To remove punctuation from a list of tokens using NLTK, you can use the
string
module to check if each token is a punctuation character. Here is an example of how to do this:
import nltkimport string# input texttext = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."# tokenize the texttokens = nltk.word_tokenize(text)# remove punctuationfiltered_tokens = [token for token in tokens if token notin string.punctuation]print("Tokens without punctuation:", filtered_tokens)
This will output the following list of tokens without punctuation:
To remove common stop words from a list of tokens using NLTK, you can use the
nltk.corpus.stopwords.words()
function to get a list of stopwords in a specific language and filter the tokens using this list. Here is an example of how to do this:
import nltk# input texttext = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."# tokenize the texttokens = nltk.word_tokenize(text)# get list of stopwords in Englishstopwords = nltk.corpus.stopwords.words("english")# remove stopwordsfiltered_tokens = [token for token in tokens if token.lower() notin stopwords]print("Tokens without stopwords:", filtered_tokens)
This will output the following list of tokens without stopwords:
White space could be spaces, tabs and newlines that don’t add value to further analysis.
To remove extra white space from a string of text using NLTK you can use the
string.strip()
function to remove leading and trailing white space, and the
string.replace()
function to replace multiple consecutive white space characters with a single space. Here is an example of how to do this:
import nltkimport string# input text with extra white spacetext = " Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language. "# remove leading and trailing white spacetext = text.strip()# replace multiple consecutive white space characters with a single spacetext = " ".join(text.split())print("Cleaned text:", text)
This will output the following cleaned text:
Cleaned text: Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language.
6. Remove URLs
To remove URLs from a string of text using NLTK, you can use a regular expression pattern to identify URLs and replace them with an empty string. Here is an example of how to do this:
import nltkimport re# input text with URLstext = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language. Check out this article for more information: https://en.wikipedia.org/wiki/Natural_language_processing"# define a regular expression pattern to match URLspattern = r"(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?"# replace URLs with an empty stringcleaned_text = re.sub(pattern, "", text)print("Text without URLs:", cleaned_text)
This will output the following text without URLs:
Text without URLs: Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language. Check out this article for more information:
7. Remove HTML code
To remove HTML code from a string of text using NLTK, you can use a regular expression pattern to identify HTML tags and replace them with an empty string. Here is an example of how to do this:
import nltkimport re# input text with HTML codetext = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language. <b>This is an example of bold text.</b>"# define a regular expression pattern to match HTML tagspattern = r"<[^>]+>"# replace HTML tags with an empty stringcleaned_text = re.sub(pattern, "", text)print("Text without HTML code:", cleaned_text)
This will output the following text without HTML code:
Text without HTML code: Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language. This is an example of bold text.
8. Remove frequent words
To remove frequent words (also known as “high-frequency words”) from a list of tokens using NLTK, you can use the
nltk.FreqDist()
function to calculate the frequency of each word and filter out the most common ones. Here is an example of how to do this:
print("Tokens without frequent words:", filtered_tokens)" style="color:#000000;display:none" aria-label="Copy" class="code-block-pro-copy-button">
import nltk# input texttext = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."# tokenize the texttokens = nltk.word_tokenize(text)# calculate the frequency of each wordfdist = nltk.FreqDist(tokens)# remove the most common words (e.g., the top 10% of words by frequency)filtered_tokens = [token for token in tokens if fdist[token] < fdist.N() * 0.1]print("Tokens without frequent words:", filtered_tokens)
This will output the following list of tokens without the most frequent words:
Correcting misspelt words is sometimes important so that the meaning of a sentence can be interpreted later in the processing.
To perform spelling correction on a list of tokens using NLTK, you can use the
nltk.corpus.words.words()
function to get a list of English words and the
nltk.edit_distance()
function to calculate the edit distance between a word and the words in the list. Here is an example of how to do this:
import nltk# input texttext = "Natural langage processing is a field of artificial intelligece that deals with the interaction between computers and human (naturl) langage."# tokenize the texttokens = nltk.word_tokenize(text)# get list of English wordswords = nltk.corpus.words.words()# correct spelling of each wordcorrected_tokens = []for token in tokens:# find the word with the lowest edit distance corrected_token = min(words, key=lambdax: nltk.edit_distance(x, token)) corrected_tokens.append(corrected_token)print("Corrected tokens:", corrected_tokens)
This will output the following list of corrected tokens:
Reducing words to their base form, such as converting “jumping” to “jump.”
To perform stemming on a list of tokens using NLTK, you can use the
nltk.stem.PorterStemmer()
function to create a stemmer object and the
stem()
method to stem each token. Here is an example of how to do this:
import nltk# input texttext = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."# tokenize the texttokens = nltk.word_tokenize(text)# create stemmer objectstemmer = nltk.stem.PorterStemmer()# stem each tokenstemmed_tokens = [stemmer.stem(token) for token in tokens]print("Stemmed tokens:", stemmed_tokens)
This will output the following list of stemmed tokens:
The Porter stemmer is a widely used algorithm that removes common morphological affixes from words in order to obtain their base form or root. Other stemmers are also available in the nltk library, such as the Snowball stemmer, which supports multiple languages.
A more complicated and accurate method of reducing words to their base form than stemming.
To perform lemmatization on a list of tokens using NLTK, you can use the
nltk.stem.WordNetLemmatizer()
function to create a lemmatizer object and the
lemmatize()
method to lemmatize each token. Here is an example of how to do this:
import nltk# input texttext = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."# tokenize the texttokens = nltk.word_tokenize(text)# create lemmatizer objectlemmatizer = nltk.stem.WordNetLemmatizer()# lemmatize each tokenlemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]print("Lemmatized tokens:", lemmatized_tokens)
This will output the following list of lemmatized tokens:
The WordNet lemmatizer uses the WordNet database of English words to lemmatize the tokens, taking into account the part of speech and the context in which the word is used. You can specify the part of speech of the token using the
pos
argument of the
lemmatize()
method (e.g.,
"n"
for nouns,
"v"
for verbs, etc.).
Identifying the part of speech of each word in the text, such as noun, verb, or adjective.
To perform part of speech (POS) tagging on a list of tokens using NLTK, you can use the
nltk.pos_tag()
function to tag the tokens with their corresponding POS tags. Here is an example of how to do this:
import nltk# input texttext = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."# tokenize the texttokens = nltk.word_tokenize(text)# tag the tokens with their POS tagstagged_tokens = nltk.pos_tag(tokens)print("Tagged tokens:", tagged_tokens)
This will output the following list of tuples with the tokens and their corresponding POS tags:
To perform named entity recognition (NER) on a list of tokens using NLTK, you can use the
nltk.ne_chunk()
function to identify and label named entities in the tokens. Here is an example of how to do this:
import nltk# input texttext = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language. John Smith works at Google in New York."# tokenize the texttokens = nltk.word_tokenize(text)# tag the tokens with their part of speechtagged_tokens = nltk.pos_tag(tokens)# identify named entitiesnamed_entities = nltk.ne_chunk(tagged_tokens)print("Named entities:", named_entities)
This will output the following list of named entities:
Standardising words or phrases that have multiple possible forms or spellings (e.g. “American” and “US” could both be normalised to “United States”). This can be easily done with a list of synonyms or industry-specific terms.
NLTK preprocessing pipeline example
Preprocessing techniques can be applied independently or in combination, depending on the specific requirements of the task at hand.
Here is an example of a typical NLP pipeline using the NLTK:
Tokenization: First, we need to split the input text into individual words (tokens). This can be done using the
nltk.word_tokenize()
function.
Part-of-speech tagging: Next, we can use the
nltk.pos_tag()
function to assign a part-of-speech (POS) tag to each token, which indicates its role in a sentence (e.g., noun, verb, adjective).
Named entity recognition: Using the
nltk.ne_chunk()
function, we can identify named entities (e.g., person, organization, location) in the text.
Lemmatization: We can use the
nltk.WordNetLemmatizer()
function to convert each token to its base form (lemma), which helps with the analysis of the text.
Stopword removal: We can use the
nltk.corpus.stopwords.words()
function to remove common words (stopwords) that do not add significant meaning to the text, such as “the,” “a,” and “an.”
Text classification: Finally, we can use the processed text to train a classifier using machine learning algorithms to perform tasks such as sentiment analysis or spam detection.
NLTK preprocessing example code
import nltk# input texttext = "Natural language processing is a field of artificial intelligence that deals with the interaction between computers and human (natural) language."# tokenizationtokens = nltk.word_tokenize(text)print("Tokens:", tokens)# part-of-speech taggingpos_tags = nltk.pos_tag(tokens)print("POS tags:", pos_tags)# named entity recognitionnamed_entities = nltk.ne_chunk(pos_tags)print("Named entities:", named_entities)# lemmatizationlemmatizer = nltk.WordNetLemmatizer()lemmas = [lemmatizer.lemmatize(token) for token in tokens]print("Lemmas:", lemmas)# stopword removalstopwords = nltk.corpus.stopwords.words("english")filtered_tokens = [token for token in tokens if token notin stopwords]print("Filtered tokens:", filtered_tokens)# text classification (example using a simple Naive Bayes classifier)from nltk.classify import NaiveBayesClassifier# training data (using a toy dataset for illustration purposes)training_data = [("It was a great movie.", "pos"), ("I hated the book.", "neg"), ("The book was okay.", "pos")]# extract features from the training datadefextract_features(text): features = {}for word in nltk.word_tokenize(text): features[word] = Truereturn features# create a list of feature sets and labelsfeature_sets = [(extract_features(text), label) for (text, label) in training_data]# train the classifierclassifier = NaiveBayesClassifier.train(feature_sets)# test the classifier on a new exampletest_text = "I really enjoyed the movie."print("Sentiment:", classifier.classify(extract_features(test_text)))
Alternatives to NLTK for preprocessing in Python
There are several alternatives to NLTK in Python that can be used for natural language processing (NLP) preprocessing tasks, such as tokenization, part-of-speech tagging, and lemmatization. Some options include:
SpaCy: This open-source library is designed for efficient NLP preprocessing and has a wide range of features, including tokenization, part-of-speech tagging, dependency parsing, and named entity recognition.
Gensim: This library provides tools for preprocessing text data, including tokenization, stopword removal, and lemmatization. It also has a wide range of algorithms for topic modelling and document similarity analysis.
Pattern: This library is a web mining module for Python that provides tools for NLP tasks, including tokenization, part-of-speech tagging, and spelling correction.
TextBlob: This library provides a simple interface for common NLP tasks, such as tokenization, part-of-speech tagging, and sentiment analysis. It is built on top of the NLTK library.
Stanford CoreNLP: This suite of NLP tools from Stanford University includes a wide range of capabilities, including tokenization, part-of-speech tagging, named entity recognition, and parsing. It is available as a standalone Java application or as a Python wrapper.
Each of these libraries has its own strengths and limitations, and the best choice will depend on your specific needs and requirements. It may be worth trying out a few different options to see which works best for your use case.
Closing thoughts on NLTK preprocessing
Using NLTK to build a preprocessing pipeline is a solid choice. Not every step in this guide needs to be used for every application, but you will probably find yourself using quite a few of these techniques with every NLP project. Getting the basics down will therefore serve you well.
What technique from this guide did you use, and what does your preprocessing pipeline look like? Let us know in the comments.
Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.
Neri Van Otten is a machine learning and software engineer with over 12 years of Natural Language Processing (NLP) experience. Dedicated to making your projects succeed.
What is Data Transformation? Data transformation is converting data from its original format or structure into a format more suitable for analysis, storage, or...
What is Real-Time Processing? Real-time processing refers to the immediate or near-immediate handling of data as it is received. Unlike traditional methods, where data...
What is Churn prediction? Churn prediction is the process of identifying customers who are likely to stop using a company's products or services in the near future....
What is Federated Learning? Federated Learning (FL) is a cutting-edge machine learning approach emphasising privacy and decentralisation. Unlike traditional machine...
In the age of digital transformation, Natural Language Processing (NLP) has emerged as a cornerstone of intelligent applications. From chatbots and voice assistants to...
What is Elastic Net Regression? Elastic Net regression is a statistical and machine learning technique that combines the strengths of Ridge (L2) and Lasso (L1)...
What is Recursive Feature Elimination? In machine learning, data often holds the key to unlocking powerful insights. However, not all data is created equal. Some...
What is High-Dimensional Data? High-dimensional data refers to datasets that contain a large number of features or variables relative to the number of observations or...
What is Out-of-Distribution Detection? Out-of-Distribution (OOD) detection refers to identifying data that differs significantly from the distribution on which a...
Get a FREE PDF with expert predictions for 2024. How will natural language processing (NLP) impact businesses? What can we expect from the state-of-the-art models?
Find out this and more by subscribing* to our NLP newsletter.
0 Comments