Text normalization is a key step in natural language processing (NLP). It involves cleaning and preprocessing text data to make it consistent and usable for different NLP tasks. The process includes a variety of techniques, such as case normalization, punctuation removal, stop word removal, stemming, and lemmatization. In this article, we will discuss the different text normalization techniques and give examples, advantages, disadvantages, and sample code in Python.
Case normalization is converting all text to lowercase or uppercase to standardize the text. This technique is useful when working with text data that contains a mix of uppercase and lowercase letters.
Input: “The quick BROWN Fox Jumps OVER the lazy dog.”
Output: “the quick brown fox jumps over the lazy dog.”
text = "The quick BROWN Fox Jumps OVER the lazy dog."
text = text.lower()
print(text)
Punctuation removal is the process of removing special characters and punctuation marks from the text. This technique is useful when working with text data containing many punctuation marks, which can make the text harder to process.
Input: “The quick BROWN Fox Jumps OVER the lazy dog!!!”
Output: “The quick BROWN Fox Jumps OVER the lazy dog”
import string
text = "The quick BROWN Fox Jumps OVER the lazy dog!!!"
text = text.translate(text.maketrans("", "", string.punctuation))
print(text)
Stop word removal is the process of removing common words with little meaning, such as “the” and “a”. This technique is useful when working with text data containing many stop words, which can make the text harder to process.
Input: “The quick BROWN Fox Jumps OVER the lazy dog.”
Output: “quick BROWN Fox Jumps OVER lazy dog.”
from nltk.corpus import stopwords
text = "The quick BROWN Fox Jumps OVER the lazy dog."
stop_words = set(stopwords.words("english"))
words = text.split()
filtered_words = [word for word in words if word not in stop_words]
text = " ".join(filtered_words)
print(text)
Stemming is reducing words to their root form by removing suffixes and prefixes, such as “running” becoming “run”. This method is helpful when working with text data that has many different versions of the same word, which can make the text harder to process.
Input: “running,runner,ran”
Output: “run,run,run”
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
text = "running,runner,ran"
words = text.split(",")
stemmed_words = [stemmer.stem(word) for word in words]
text = ",".join(stemmed_words)
print(text)
Lemmatization is reducing words to their base form by considering the context in which they are used, such as “running” becoming “run”. This technique is similar to stemming, but it is more accurate as it considers the context of the word.
Input: “running,runner,ran”
Output: “run,runner,run”
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
text = "running,runner,ran"
words = text.split(",")
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
text = ",".join(lemmatized_words)
print(text)
Tokenization is the process of breaking text into individual words or phrases, also known as “tokens”. This technique is useful when working with text data that needs to be analyzed at the word or phrase level, such as in text classification or language translation tasks.
Input: “The quick BROWN Fox Jumps OVER the lazy dog.”
Output: [“The”, “quick”, “BROWN”, “Fox”, “Jumps”, “OVER”, “the”, “lazy”, “dog.”]
from nltk.tokenize import word_tokenize
text = "The quick BROWN Fox Jumps OVER the lazy dog."
tokens = word_tokenize(text)
print(tokens)
This technique is useful when working with text data that contains synonyms or abbreviations that need to be replaced by their full form.
Input: “I’ll be there at 2pm”
Output: “I will be there at 2pm”
text = "I'll be there at 2pm"
synonyms = {"I'll": "I will", "2pm": "2 pm"}
for key, value in synonyms.items():
text = text.replace(key, value)
print(text)
This technique is useful when working with text data that contain numbers and symbols that are not important for the NLP task.
Input: “I have 2 apples and 1 orange #fruits”
Output: “I have apples and orange fruits”
import re
text = "I have 2 apples and 1 orange #fruits"
text = re.sub(r"[\d#]", "", text)
print(text)
Removing any remaining non-textual elements such as HTML tags, URLs, and email addresses This technique is useful when working with text data that contains non-textual elements such as HTML tags, URLs, and email addresses that are not important for the NLP task.
Input: “Please visit <a href=’www.example.com‘>example.com</a> for more information or contact me at info@example.com”
Output: “Please visit for more information or contact me at “
import re
text = "Please visit <a href='www.example.com'>example.com</a> for more information or contact me at info@example.com"
text = re.sub(r"(<[^>]+>)|(http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)", "", text)
print(text)
It’s important to note that these steps should be applied depending on the specific requirements of the NLP task and the type of text data being processed.
Text normalization is an iterative process, and the steps may be repeated multiple times.
Keyword normalization techniques in NLP are used to standardize and clean keywords or phrases in text data, in order to make them more usable for natural language processing tasks.
Keyword normalisation makes them more useful for further analysis.
The above steps for normalising text in NLP can also all be used on a list of keywords or phrases. They be used to make keywords and phrases more consistent, more easily searchable, and more usable for natural language processing tasks such as text classification, information retrieval, and natural language understanding.
Text normalization techniques are essential for preparing text data for natural language processing (NLP) tasks. Each technique has its advantages and disadvantages, and the appropriate technique depends on the specific requirements of the NLP task and the type of text data being processed.
It is also important to note that text normalization is an iterative process, and the steps may be repeated multiple times depending on the requirements of the NLP task.
Have you ever wondered why raising interest rates slows down inflation, or why cutting down…
Introduction Reinforcement Learning (RL) has seen explosive growth in recent years, powering breakthroughs in robotics,…
Introduction Imagine a group of robots cleaning a warehouse, a swarm of drones surveying a…
Introduction Imagine trying to understand what someone said over a noisy phone call or deciphering…
What is Structured Prediction? In traditional machine learning tasks like classification or regression a model…
Introduction Reinforcement Learning (RL) is a powerful framework that enables agents to learn optimal behaviours…