Complete Guide to N-Grams And A How To Implement Them In Python With NLTK

by | Apr 5, 2023 | Natural Language Processing

In natural language processing, n-grams are a contiguous sequence of n items from a given sample of text or speech. These items can be characters, words, or other units of text, and they are used to analyze the frequency and distribution of linguistic patterns in a given sample.

A 2-gram, also called a “bigram,” might be two words next to each other in a sentence, like “natural language” or “language processing.” While a 3-gram is a group of three words that are close together. For example, “processing natural language” or “language processing models” are 3-grams. 

N-grams are useful for many natural language processing tasks, like language modelling, machine translation, and sentiment analysis, because they show how words and phrases in text fit into their local context. They are also used in information retrieval systems, where they can be used to match search queries with relevant documents based on shared n-grams.

A simple example of n-grams

Consider the sentence: “The quick brown fox jumps over the lazy dog”

A trigram of this sentence would be a sequence of three words. We can generate all possible trigrams from this sentence by sliding a window of three words over the sentence:

  • “The quick brown”
  • “quick brown fox”
  • “brown fox jumps”
  • “fox jumps over”
  • “jumps over the lazy”
  • “over the lazy dog”

Then, we can use these trigrams to do natural language processing tasks like modelling, classifying text, or sentiment analysis. For example, if we wanted to use these trigrams for text classification, we could count the frequency of each trigram in a set of documents and use these counts as features in a machine learning model.

Advantages of n-grams

Some advantages of using n-grams in natural language processing include:

  1. Local context: N-grams make it possible to get local context from text, which can help you understand how words and phrases are used and what they mean. 
  2. Flexibility: N-grams can be used with single characters and larger chunks of text, making it possible to analyse language in different ways. 
  3. Efficiency: It is easy to compute and store N-grams, which makes them good for large-scale tasks in natural language processing. 
  4. Feature extraction: N-grams can be used as features in machine learning models. This is a way to capture linguistic patterns and make models better at tasks like classifying text, analysing sentiment, and finding information. 
  5. Language modelling: N-grams are used in language modelling, an important task in natural language processing. Language models based on n-grams can be used for speech recognition, machine translation, and text prediction tasks.

Overall, using n-grams can help improve the accuracy and efficiency of various natural language processing tasks and provide insights into the usage and meaning of language in the text.

Disadvantages of n-grams

In natural language processing, n-grams have some good points but also some problems and drawbacks.

  1. Local context: N-grams can pick up on local context but may not be able to pick up on broader context. This can make them less useful for tasks like discourse analysis or determining a text’s meaning. 
  2. Data sparsity: As the size of an n-gram grows, so does the number of possible combinations. This can lead to data sparsity and make it harder to get accurate information about all possible n-grams. 
  3. Lack of semantic understanding: N-grams do not have a semantic understanding of language, which means that they may not be able to capture the meaning or intent behind words and phrases and may be prone to errors in some tasks.
  4. Overfitting: Using n-grams as features in machine learning models can lead to overfitting, which occurs when a model is too complex and performs well on training data but poorly on new or unseen data.
  5. Concerning performance: N-grams can be hard to compute, especially when working with large datasets. This can affect how well some natural language processing tasks work. 

Overall, n-grams have a lot of uses in natural language processing, but it’s important to know their limits and use them in the right way for the job. 

Alternatives to n-grams

There are several alternatives to using n-grams in natural language processing:

  1. Word embeddings: Word embeddings are a way to represent words as dense vectors in a high-dimensional space, where words with similar meanings are closer together. They can show how words are logically related to each other and are often used as features in machine learning models for tasks that deal with natural language processing. 
  2. Syntax-based models: These models look at how a sentence or text is put together syntactically to determine what it means. Examples of syntax-based models include dependency parsers and constituency parsers.
  3. Topic modelling is a statistical method for finding the main topics in a large text collection. It can be used for text classification, clustering, and information retrieval tasks.
  4. Deep learning models: Deep learning models, such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs), can be used for a wide range of natural language processing tasks, such as language modelling, sentiment analysis, and machine translation. 
  5. Rule-based models: Rule-based models use a set of handcrafted rules to analyze and understand the text. While they may not be as flexible as other approaches, they can be useful in certain domains with well-defined rules, such as medical or legal text.

Overall, there are many ways to do natural language processing without using n-grams. The best way will depend on the task and domain being studied. 

Applications of n-grams 

N-grams can be used in various ways for different natural language processing tasks. Here are a few examples:

  1. Language modelling: N-grams can be used to model how words in a language will likely be used together. This model can then make new text or determine a given sentence’s likelihood. For example, a trigram language model would calculate the probability of a word based on the previous two words.
  2. Text classification: N-grams can be used as features in machine learning models for text classification tasks. For example, we can create a bag of words representation of a document by counting the frequency of each n-gram in the text and then use this representation as input to a machine learning classifier.
  3. Sentiment analysis: N-grams can determine how a text makes you feel and put it into groups. For example, we can use bigrams to find common phrases that show whether someone is happy or sad. 
  4. Named entity recognition: N-grams can be used to identify named entities in text, such as names of people, organizations, or locations. For example, we can use trigrams to find common phrases likely to be the names of organizations, like “New York Times” or “United Nations.” 
  5. Machine translation: N-grams can be used as a feature in machine translation models to improve the accuracy of the translation. For example, we can use bigrams or trigrams to capture common phrases or collocations in the source and target languages.
New York will be split up and lose its meaning if you don't use n-grams in your word representation.

“New York” is an example of a word that loses meaning when split into two words and where the use of n-grams shines.

The choice of n-gram size and how to use them will depend on the specific natural language processing task and the data being analyzed. It is important to try out different n-gram sizes and methods to find the best method for each task. 

Choosing the right n: the n-gram bias-versus-variance trade-off

The bias-variance trade-off is a fundamental machine learning concept that applies to n-grams. In n-grams, the bias-variance trade-off refers to the trade-off between the model’s generalisation ability (bias) and its ability to fit the training data well (variance).

A model with a lot of bias will be too simple and might not pick up on all the important patterns in the training data. This is called “underfitting.” On the other hand, a model with high variance will be too complex and may fit the training data too well, leading to overfitting.

In the case of n-grams, the size of the n-gram can be seen as a hyperparameter that controls the bias-variance trade-off. Smaller n-grams (e.g., unigrams or bigrams) may have higher bias but lower variance, while larger n-grams (e.g., trigrams or four-grams) may have lower bias but higher variance. The choice of n-gram size will depend on the specific natural language processing task and the size and complexity of the dataset.

It’s important to find the right balance between bias and variance when using n-grams, as both overfitting and underfitting can lead to poor performance. Overfitting can be controlled in n-gram models with regularisation techniques like L1 or L2 regularization, and underfitting can be controlled by adding more training data. Cross-validation can also evaluate the bias-variance trade-off and choose the best n-gram size for the given task and data.

How to implement n-grams in Python with NLTK

You can use the NLTK (Natural Language Toolkit) library in Python to create n-grams from text data. The following code snippet shows how to create bigrams (2-grams) from a list of words using NLTK:

from nltk.util import ngrams

# Example sentence
sentence = "Natural language processing is a field of study focused on the interactions between human language and computers."

# Tokenize the sentence into words
words = sentence.split()

# Create bigrams from the list of words
bigrams = ngrams(words, 2)

# Print the bigrams
for bigram in bigrams:
    print(bigram)

Output:

('Natural', 'language')
('language', 'processing')
('processing', 'is')
('is', 'a')
('a', 'field')
('field', 'of')
('of', 'study')
('study', 'focused')
('focused', 'on')
('on', 'the')
('the', 'interactions')
('interactions', 'between')
('between', 'human')
('human', 'language')
('language', 'and')
('and', 'computers.')

In this example, we first tokenize the sentence into a list of words using the split() method. We then use the ngrams() function from NLTK to create bigrams from the list of words. Finally, we iterate over the bigrams and print them.

Note that you can change the size of the n-grams by passing a different value as the second argument to the ngrams() function. For example, to create trigrams, you would pass ngrams(words, 3).

What is a skip-gram?

Skip-gram is a word embedding model often used in natural language processing tasks, especially for those with bigger vocabularies. Unlike n-grams, which represent fixed-length sequences of adjacent words, skip-gram represents each word as a dense vector in a high-dimensional space. The distance between vectors reflects the similarity between words.

A skip-gram model attempts to learn each word’s vector so that words with similar semantic properties are close to one another in the vector space. The skip-gram model teaches a neural network to predict the context words that come close to a given target word. For example, given the sentence “The cat sat on the mat,” the skip-gram model might try to predict the context words “cat,” “on,” and “the” given the target word “sat.”

Many large text corpora, like Wikipedia or web crawls, are used to train skip-gram models, which can determine complex relationships between words based on how often they appear together. They are good at various natural language processing tasks, such as figuring out how people feel about something, recognising names, and translating text. Popular libraries for training skip-gram models include Gensim and TensorFlow.

Conclusion

N-grams are a powerful tool in natural language processing that can help us model the probability distribution of words in a language and capture complex relationships between words based on their co-occurrence patterns. N-grams can be used for various tasks, such as language modelling, text classification, and sentiment analysis.

However, it’s important to consider the bias-variance trade-off when using n-grams and choose the appropriate n-gram size based on the specific task and data. In addition, there are alternative models, such as skip-grams, that can be used to generate word embeddings that capture more nuanced relationships between words.

Overall, n-grams are a powerful and versatile tool that can help us extract insights from text data and make sense of the vast amounts of natural language data generated daily.

About the Author

Neri Van Otten

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.

Related Articles

Most Powerful Open Source Large Language Models (LLM) 2023

Open Source Large Language Models (LLM) – Top 10 Most Powerful To Consider In 2023

What are open-source large language models? Open-source large language models, such as GPT-3.5, are advanced AI systems designed to understand and generate human-like...

l1 and l2 regularization promotes simpler models that capture the underlying patterns and generalize well to new data

L1 And L2 Regularization Explained, When To Use Them & Practical Examples

L1 and L2 regularization are techniques commonly used in machine learning and statistical modelling to prevent overfitting and improve the generalization ability of a...

Hyperparameter tuning often involves a combination of manual exploration, intuition, and systematic search methods

Hyperparameter Tuning In Machine Learning & Deep Learning [The Ultimate Guide With How To Examples In Python]

What is hyperparameter tuning in machine learning? Hyperparameter tuning is critical to machine learning and deep learning model development. Machine learning...

Countvectorizer is a simple techniques that counts the amount of times a word occurs

CountVectorizer Tutorial In Scikit-Learn And Python (NLP) With Advantages, Disadvantages & Alternatives

What is CountVectorizer in NLP? CountVectorizer is a text preprocessing technique commonly used in natural language processing (NLP) tasks for converting a collection...

Social media messages is an example of unstructured data

Difference Between Structured And Unstructured Data & How To Turn Unstructured Data Into Structured Data

Unstructured data has become increasingly prevalent in today's digital age and differs from the more traditional structured data. With the exponential growth of...

sklearn confusion matrix

F1 Score The Ultimate Guide: Formulas, Explanations, Examples, Advantages, Disadvantages, Alternatives & Python Code

The F1 score formula The F1 score is a metric commonly used to evaluate the performance of binary classification models. It is a measure of a model's accuracy, and it...

regression vs classification, what is the difference

Regression Vs Classification — Understand How To Choose And Switch Between Them

Classification vs regression are two of the most common types of machine learning problems. Classification involves predicting a categorical outcome, such as whether an...

Several images of probability densities of the Dirichlet distribution as functions.

Latent Dirichlet Allocation (LDA) Made Easy And Top 3 Ways To Implement In Python

Latent Dirichlet Allocation explained Latent Dirichlet Allocation (LDA) is a statistical model used for topic modelling in natural language processing. It is a...

One of the critical features of GPT-3 is its ability to perform few-shot and zero-shot learning. Fine tuning can further improve GPT-3

How To Fine-tuning GPT-3 Tutorial In Python With Hugging Face

What is GPT-3? GPT-3 (Generative Pre-trained Transformer 3) is a state-of-the-art language model developed by OpenAI, a leading artificial intelligence research...

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

nlp trends

2023 NLP Expert Trend Predictions

Get a FREE PDF with expert predictions for 2023. How will natural language processing (NLP) impact businesses? What can we expect from the state-of-the-art models?

Find out this and more by subscribing* to our NLP newsletter.

You have Successfully Subscribed!