N-grams Made Simple & How To Implement In Python (NLTK)

by | Apr 5, 2023 | Natural Language Processing

In natural language processing, n-grams are a contiguous sequence of n items from a given sample of text or speech. These items can be characters, words, or other units of text, and they are used to analyze the frequency and distribution of linguistic patterns in a given sample.

A 2-gram, also called a “bigram,” might be two words next to each other in a sentence, like “natural language” or “language processing.” While a 3-gram is a group of three words that are close together. For example, “processing natural language” or “language processing models” are 3-grams. 

N-grams are useful for many natural language processing tasks, like language modelling, machine translation, and sentiment analysis, because they show how words and phrases in text fit into their local context. They are also used in information retrieval systems, where they can be used to match search queries with relevant documents based on shared n-grams.

A simple example of n-grams

Consider the sentence: “The quick brown fox jumps over the lazy dog”

A trigram of this sentence would be a sequence of three words. We can generate all possible trigrams from this sentence by sliding a window of three words over the sentence:

  • “The quick brown”
  • “quick brown fox”
  • “brown fox jumps”
  • “fox jumps over”
  • “jumps over the lazy”
  • “over the lazy dog”

Then, we can use these trigrams to do natural language processing tasks like modelling, classifying text, or sentiment analysis. For example, if we wanted to use these trigrams for text classification, we could count the frequency of each trigram in a set of documents and use these counts as features in a machine learning model.

Advantages of n-grams

Some advantages of using n-grams in natural language processing include:

  1. Local context: N-grams make it possible to get local context from text, which can help you understand how words and phrases are used and what they mean. 
  2. Flexibility: N-grams can be used with single characters and larger chunks of text, making it possible to analyse language in different ways. 
  3. Efficiency: It is easy to compute and store N-grams, which makes them good for large-scale tasks in natural language processing. 
  4. Feature extraction: N-grams can be used as features in machine learning models. This is a way to capture linguistic patterns and make models better at tasks like classifying text, analysing sentiment, and finding information. 
  5. Language modelling: N-grams are used in language modelling, an important task in natural language processing. Language models based on n-grams can be used for speech recognition, machine translation, and text prediction tasks.

Overall, using n-grams can help improve the accuracy and efficiency of various natural language processing tasks and provide insights into the usage and meaning of language in the text.

Disadvantages of n-grams

In natural language processing, n-grams have some good points but also some problems and drawbacks.

  1. Local context: N-grams can pick up on local context but may not be able to pick up on broader context. This can make them less useful for tasks like discourse analysis or determining a text’s meaning. 
  2. Data sparsity: As the size of an n-gram grows, so does the number of possible combinations. This can lead to data sparsity and make it harder to get accurate information about all possible n-grams. 
  3. Lack of semantic understanding: N-grams do not have a semantic understanding of language, which means that they may not be able to capture the meaning or intent behind words and phrases and may be prone to errors in some tasks.
  4. Overfitting: Using n-grams as features in machine learning models can lead to overfitting, which occurs when a model is too complex and performs well on training data but poorly on new or unseen data.
  5. Concerning performance: N-grams can be hard to compute, especially when working with large datasets. This can affect how well some natural language processing tasks work. 

Overall, n-grams have a lot of uses in natural language processing, but it’s important to know their limits and use them in the right way for the job. 

Alternatives to n-grams

There are several alternatives to using n-grams in natural language processing:

  1. Word embeddings: Word embeddings are a way to represent words as dense vectors in a high-dimensional space, where words with similar meanings are closer together. They can show how words are logically related to each other and are often used as features in machine learning models for tasks that deal with natural language processing. 
  2. Syntax-based models: These models look at how a sentence or text is put together syntactically to determine what it means. Examples of syntax-based models include dependency parsers and constituency parsers.
  3. Topic modelling is a statistical method for finding the main topics in a large text collection. It can be used for text classification, clustering, and information retrieval tasks.
  4. Deep learning models: Deep learning models, such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs), can be used for a wide range of natural language processing tasks, such as language modelling, sentiment analysis, and machine translation. 
  5. Rule-based models: Rule-based models use a set of handcrafted rules to analyze and understand the text. While they may not be as flexible as other approaches, they can be useful in certain domains with well-defined rules, such as medical or legal text.

Overall, there are many ways to do natural language processing without using n-grams. The best way will depend on the task and domain being studied. 

Applications of n-grams 

N-grams can be used in various ways for different natural language processing tasks. Here are a few examples:

  1. Language modelling: N-grams can be used to model how words in a language will likely be used together. This model can then make new text or determine a given sentence’s likelihood. For example, a trigram language model would calculate the probability of a word based on the previous two words.
  2. Text classification: N-grams can be used as features in machine learning models for text classification tasks. For example, we can create a bag of words representation of a document by counting the frequency of each n-gram in the text and then use this representation as input to a machine learning classifier.
  3. Sentiment analysis: N-grams can determine how a text makes you feel and put it into groups. For example, we can use bigrams to find common phrases that show whether someone is happy or sad. 
  4. Named entity recognition: N-grams can be used to identify named entities in text, such as names of people, organizations, or locations. For example, we can use trigrams to find common phrases likely to be the names of organizations, like “New York Times” or “United Nations.” 
  5. Machine translation: N-grams can be used as a feature in machine translation models to improve the accuracy of the translation. For example, we can use bigrams or trigrams to capture common phrases or collocations in the source and target languages.
New York will be split up and lose its meaning if you don't use n-grams in your word representation.

“New York” is an example of a word that loses meaning when split into two words and where the use of n-grams shines.

The choice of n-gram size and how to use them will depend on the specific natural language processing task and the data being analyzed. It is important to try out different n-gram sizes and methods to find the best method for each task. 

Choosing the right n: the n-gram bias-versus-variance trade-off

The bias-variance trade-off is a fundamental machine learning concept that applies to n-grams. In n-grams, the bias-variance trade-off refers to the trade-off between the model’s generalisation ability (bias) and its ability to fit the training data well (variance).

A model with a lot of bias will be too simple and might not pick up on all the important patterns in the training data. This is called “underfitting.” On the other hand, a model with high variance will be too complex and may fit the training data too well, leading to overfitting.

In the case of n-grams, the size of the n-gram can be seen as a hyperparameter that controls the bias-variance trade-off. Smaller n-grams (e.g., unigrams or bigrams) may have higher bias but lower variance, while larger n-grams (e.g., trigrams or four-grams) may have lower bias but higher variance. The choice of n-gram size will depend on the specific natural language processing task and the size and complexity of the dataset.

It’s important to find the right balance between bias and variance when using n-grams, as both overfitting and underfitting can lead to poor performance. Overfitting can be controlled in n-gram models with regularisation techniques like L1 or L2 regularization, and underfitting can be controlled by adding more training data. Cross-validation can also evaluate the bias-variance trade-off and choose the best n-gram size for the given task and data.

How to implement n-grams in Python with NLTK

You can use the NLTK (Natural Language Toolkit) library in Python to create n-grams from text data. The following code snippet shows how to create bigrams (2-grams) from a list of words using NLTK:

from nltk.util import ngrams

# Example sentence
sentence = "Natural language processing is a field of study focused on the interactions between human language and computers."

# Tokenize the sentence into words
words = sentence.split()

# Create bigrams from the list of words
bigrams = ngrams(words, 2)

# Print the bigrams
for bigram in bigrams:


('Natural', 'language')
('language', 'processing')
('processing', 'is')
('is', 'a')
('a', 'field')
('field', 'of')
('of', 'study')
('study', 'focused')
('focused', 'on')
('on', 'the')
('the', 'interactions')
('interactions', 'between')
('between', 'human')
('human', 'language')
('language', 'and')
('and', 'computers.')

In this example, we first tokenize the sentence into a list of words using the split() method. We then use the ngrams() function from NLTK to create bigrams from the list of words. Finally, we iterate over the bigrams and print them.

Note that you can change the size of the n-grams by passing a different value as the second argument to the ngrams() function. For example, to create trigrams, you would pass ngrams(words, 3) .

What is a skip-gram?

Skip-gram is a word embedding model often used in natural language processing tasks, especially for those with bigger vocabularies. Unlike n-grams, which represent fixed-length sequences of adjacent words, skip-gram represents each word as a dense vector in a high-dimensional space. The distance between vectors reflects the similarity between words.

A skip-gram model attempts to learn each word’s vector so that words with similar semantic properties are close to one another in the vector space. The skip-gram model teaches a neural network to predict the context words that come close to a given target word. For example, given the sentence “The cat sat on the mat,” the skip-gram model might try to predict the context words “cat,” “on,” and “the” given the target word “sat.”

Many large text corpora, like Wikipedia or web crawls, are used to train skip-gram models, which can determine complex relationships between words based on how often they appear together. They are good at various natural language processing tasks, such as figuring out how people feel about something, recognising names, and translating text. Popular libraries for training skip-gram models include Gensim and TensorFlow.


N-grams are a powerful tool in natural language processing that can help us model the probability distribution of words in a language and capture complex relationships between words based on their co-occurrence patterns. N-grams can be used for various tasks, such as language modelling, text classification, and sentiment analysis.

However, it’s important to consider the bias-variance trade-off when using n-grams and choose the appropriate n-gram size based on the specific task and data. In addition, there are alternative models, such as skip-grams, that can be used to generate word embeddings that capture more nuanced relationships between words.

Overall, n-grams are a powerful and versatile tool that can help us extract insights from text data and make sense of the vast amounts of natural language data generated daily.

About the Author

Neri Van Otten

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.

Recent Articles

ROC curve

ROC And AUC Curves In Machine Learning Made Simple & How To Tutorial In Python

What are ROC and AUC Curves in Machine Learning? The ROC Curve The ROC (Receiver Operating Characteristic) curve is a graphical representation used to evaluate the...

decision boundaries for naive bayes

Naive Bayes Classification Made Simple & How To Tutorial In Python

What is Naive Bayes? Naive Bayes classifiers are a group of supervised learning algorithms based on applying Bayes' Theorem with a strong (naive) assumption that every...

One class SVM anomaly detection plot

How To Implement Anomaly Detection With One-Class SVM In Python

What is One-Class SVM? One-class SVM (Support Vector Machine) is a specialised form of the standard SVM tailored for unsupervised learning tasks, particularly anomaly...

decision tree example of weather to play tennis

Decision Trees In ML Complete Guide [How To Tutorial, Examples, 5 Types & Alternatives]

What are Decision Trees? Decision trees are versatile and intuitive machine learning models for classification and regression tasks. It represents decisions and their...

graphical representation of an isolation forest

Isolation Forest For Anomaly Detection Made Easy & How To Tutorial

What is an Isolation Forest? Isolation Forest, often abbreviated as iForest, is a powerful and efficient algorithm designed explicitly for anomaly detection. Introduced...

Illustration of batch gradient descent

Batch Gradient Descent In Machine Learning Made Simple & How To Tutorial In Python

What is Batch Gradient Descent? Batch gradient descent is a fundamental optimization algorithm in machine learning and numerical optimisation tasks. It is a variation...

Techniques for bias detection in machine learning

Bias Mitigation in Machine Learning [Practical How-To Guide & 12 Strategies]

In machine learning (ML), bias is not just a technical concern—it's a pressing ethical issue with profound implications. As AI systems become increasingly integrated...

text similarity python

Full-Text Search Explained, How To Implement & 6 Powerful Tools

What is Full-Text Search? Full-text search is a technique for efficiently and accurately retrieving textual data from large datasets. Unlike traditional search methods...

the hyperplane in a support vector regression (SVR)

Support Vector Regression (SVR) Simplified & How To Tutorial In Python

What is Support Vector Regression (SVR)? Support Vector Regression (SVR) is a machine learning technique for regression tasks. It extends the principles of Support...


Submit a Comment

Your email address will not be published. Required fields are marked *

nlp trends

2024 NLP Expert Trend Predictions

Get a FREE PDF with expert predictions for 2024. How will natural language processing (NLP) impact businesses? What can we expect from the state-of-the-art models?

Find out this and more by subscribing* to our NLP newsletter.

You have Successfully Subscribed!