Natural language processing is a subfield of machine learning and information retrieval that focuses on processing textual data. There are many different natural language processing techniques, tools, and models. We go over the ten most useful techniques in greater detail here so you can get started with or understand any NLP project.
Tokenization is one of the most fundamental and straightforward natural language processing techniques for handling natural language. It splits the text into smaller, bite-sized pieces; the most straightforward way is splitting a sentence into its words. These individual words are called “tokens.”
Splitting a sentence is known as tokenization.
Tokenization is a necessary first step for any text-based NLP application. Tokens, shorter segments of words, symbols and numbers, are used to break up long text strings. These tokens are the cornerstone and help comprehend the context when building an NLP model. Tokenizers typically create tokens by using a “space” as a separator. In NLP, various tokenization techniques are commonly used depending on the language and modelling objective.
Stemming and lemmatization are the second-most widely used natural language processing techniques in many pre-processing pipelines. Let’s start with an example. When searching for products in an online store, we want to see products for the exact word we entered and any other possible variations. For example, if we type “shirts” into the search box, we’ll likely want to see product results that include that word or are derived from it, like “t-shirt.”
Similar words in English often appear differently depending on the tense they are used in and where they are positioned in a sentence. For instance, although the terms “go,” “going,” and “went” are all the same, how they are used depends on the context of the sentence. The stemming or lemmatization technique aims to produce the root words from these word variants.
The difference between Stemming and Lemmatization.
Stemming is a rather primitive heuristic technique. It attempts to accomplish the objective above by removing the ends of words, which may or may not produce a meaningful word at the end. For example, “going” would be changed to “go”, and “boring” would become “bor.” As you can see, stemming results are only sometimes useful for further analysis, as some words lose their meaning.
On the other hand, lemmatization is a more advanced technique that aims to carry out this task correctly. Lemmatization makes use of vocabulary and the morphological analysis of words. It restores the base or dictionary form of a word called a lemma by removing the inflectional endings.
Stop word removal is the pre-processing step that often immediately follows stemming or lemmatization. Unfortunately, many words in every language serve only as filler and have no inherent meaning. Most of these words are conjunctions (such as “because,” “and,” or “since”) or prepositions (such as “under,” “above,” “in,” or “at”) that are used to join sentences. Unfortunately, most human language is composed of these words, which could be more helpful when creating an NLP model.
Stop word removal is not a guaranteed natural language processing technique for every model. For instance, removing stop words from the text can help some models focus on words that define the meaning of the text in the dataset when classifying text into different categories (for example, genre classification, spam filtering, or automatic tag generation). Stop words may, however, be necessary for tasks like text summarization and machine translation.
TF-IDF is a statistical method for determining a word’s importance to a document within a collection of documents. The term frequency (TF) and the inverse document frequency (IDF) are two distinct values multiplied to create the TF-IDF statistical measure.
The term frequency is calculated by the frequency of a word in a document. Words that frequently occur like “the”, “a”, and “and”, will have a high term frequency, while words that are unique in the document will have a low term frequency.
Before discussing inverse document frequency, it’s easier to grasp document frequency first. In a corpus of several documents, document frequency analysis measures the frequency with which a word appears throughout the entire corpus. Like the term frequency, commonly used words will have a high document frequency.
The exact opposite of document frequency is inverse document frequency. The inverse document frequency, therefore, assigns a small weight and little importance to frequently used words. Words that rarely occur then get a high rating and become more important. The inverse document frequency essentially assesses a term’s uniqueness within a corpus. Terms with a high IDF are very specific to a given document.
TF-IDF is an important measure to identify keywords in a document by identifying those frequently occurring in that document but not elsewhere in the corpus.
Want to continue reading? Here is an in-depth article on the advantages, disadvantages, use cases and code snippets to get you started with TF-IDF.
Whenever you read, whether it’s a book, a newspaper, or simply a piece of text on your phone, you unconsciously skim through it. You largely ignore filler words and focus on the text’s essential phrases, and everything else falls into place. Finding significant keywords in a document is exactly what keyword extraction does. Using the text analysis natural language processing techniques of keyword extraction, you can quickly gain insightful knowledge about a subject. The keyword extraction technique can be used to condense the text and extract pertinent keywords without having to go through the entire document. For example, when a company wants to identify issues customers have based on social media messages or if you’re going to identify topics of interest from a recent news item, the keyword extraction technique is a beneficial technique to use.
The simplest way to implement this is with a count vectorizer. All this does is count the occurrences of every word, and then you can return the top 10 most popular terms. Remember that this technique is best used with the stop word removal technique described above, or your top words will likely be common words.
TF-IDF is another popular way of implementing keyword extraction, as this technique also looks at the uniqueness of a word. See the section above for a more detailed explanation.
Countless libraries have keyword extraction techniques that you can use and that will work better for different use cases.
Most machine learning models only take numerical input. So before we can do machine learning, we need to turn our text into numbers. But how can we translate a block of text into numbers that can be fed to these models? The solution is straightforward: use the word embedding method to represent text data. The added benefit of using word embeddings is that we can represent words with similar meanings in a similar way. So similar words will be close together numerically, while words with nothing in common will be far apart.
The numerical representations of words in a language are called word embeddings, also referred to as vectors. These representations must be learned for words with similar meanings to have vectors that are close to one another. Words are represented as real-valued vectors or coordinates in an n-dimensional, predefined vector space.
For a custom dataset, one can use predefined word embeddings (learned on a massive corpus like Wikipedia) or learn word embeddings from scratch. Word embeddings come in various forms, including GloVe, Word2Vec, FastText, TF-IDF, CountVectorizer, BERT, ELMO, and others.
Sentiment analysis is detecting the emotions associated with a piece of text. It’s a form of text classification where text fragments are classified as either positive, negative, or neutral. Sentiment analysis is extremely useful for automatically detecting the tone of tweets, newspaper articles, reviews, or customer emails. In addition, sentiment analysis is often used to report on customer success or brand sentiment and help brands detect unhappy customers on social media platforms.
Sentiment analysis categorizes a text into different sentiment categories automatically.
Topic modelling is a statistical natural language processing technique that examines a corpus of text documents to identify common themes. This is an unsupervised machine learning algorithm, which does not require labelled data, so documents can be used as is without any prior manual work. With this method, we can organize and compile electronic archives on a larger scale than would be possible with human annotation. Many different algorithms can carry out topic modelling; one of the most effective methods is latent Dirichlet allocation (LDA).
With topic modelling, we can find what an article is about without reading it or even searching in a large corpus of documents for a specific article on a particular topic.
Text summarization is an NLP tool used to summarise a text clearly, succinctly, and coherently. Summarizing helps you get the essential information out of documents without having to read them word for word. If done manually, this process would take a long time; however, automatic text summarization drastically cuts down on this time. There are two different approaches to text summarization.
For the most popular ML and deep learning summarization algorithms see this article.
Named entity recognition (NER) is a subfield of information extraction that deals with finding and categorizing named entities. It transfroms an unstructured document into predefined categories. People’s names, organizations, locations, events, dates, and monetary values are common categories. Except for the fact that the extracted keywords are added to predefined categories, NER and keyword extraction are somewhat comparable. Many pre-trained NER algorithm implementations exist that you can use without needing any labelled data to train on.
At Spot Intelligence, we regularly use all these natural language processing techniques in our modelling. What techniques are your favourites, or what should have been on this list and isn’t? Let us know in the comments.
Have you ever wondered why raising interest rates slows down inflation, or why cutting down…
Introduction Reinforcement Learning (RL) has seen explosive growth in recent years, powering breakthroughs in robotics,…
Introduction Imagine a group of robots cleaning a warehouse, a swarm of drones surveying a…
Introduction Imagine trying to understand what someone said over a noisy phone call or deciphering…
What is Structured Prediction? In traditional machine learning tasks like classification or regression a model…
Introduction Reinforcement Learning (RL) is a powerful framework that enables agents to learn optimal behaviours…