CountVectorizer Tutorial: How To Easily Turn Text Into Features For Any NLP Task

by | May 17, 2023 | Data Science, Natural Language Processing

What is CountVectorizer in NLP?

CountVectorizer is a text preprocessing technique commonly used in natural language processing (NLP) tasks for converting a collection of text documents into a numerical representation. It is part of the scikit-learn library, a popular machine learning library in Python.

CountVectorizer operates by tokenizing the text data and counting the occurrences of each token. It then creates a matrix where the rows represent the documents, and the columns represent the tokens. The cell values indicate the frequency of each token in each document. This matrix is known as the “document-term matrix.”

CountVectorizer Python example with scikit-learn

from sklearn.feature_extraction.text import CountVectorizer

# Sample documents
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",

# Create an instance of CountVectorizer
vectorizer = CountVectorizer()

# Fit the vectorizer to the documents and transform the documents into a document-term matrix
X = vectorizer.fit_transform(documents)

# Get the feature names (tokens)
feature_names = vectorizer.get_feature_names()

# Print the feature names

# Print the document-term matrix


['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

[[0 1 1 1 0 0 1 0 1]
[0 2 0 1 0 1 1 0 1]
[1 0 0 1 1 0 1 1 1]
[0 1 1 1 0 0 1 0 1]]

In the example, the fit_transform method of CountVectorizer both fits the vectorizer to the documents (learns the vocabulary) and transforms the documents into a document-term matrix. The resulting matrix represents the frequency of each token in each document.

CountVectorizer offers various parameters and options to control its behaviour, such as specifying the minimum document frequency for a token to be included, removing stop words, and using n-grams instead of single tokens. These options can be explored in the scikit-learn documentation for further customization based on specific needs.

Advantages and disadvantages 

Advantages of CountVectorizer

  1. Simplicity: CountVectorizer is easy to use and understand. It has specific parameters and requires minimal configuration to get started with text preprocessing.
  2. Speed and Efficiency: CountVectorizer is computationally efficient and can handle large text datasets with many documents. It utilizes sparse matrix representations to save memory and processing time, especially when dealing with high-dimensional data.
  3. Versatility: CountVectorizer allows for flexible tokenization options, including handling n-grams (consecutive sequences of words) and custom token patterns. It also provides opportunities for filtering stop words and controlling the vocabulary size.
  4. Interpretable Results: The resulting document-term matrix from CountVectorizer provides interpretable results. Each cell in the matrix represents the count or frequency of a token in a specific document, allowing for straightforward analysis and exploration.

Disadvantages of CountVectorizer

  1. Ignores Semantic Information: It treats each token as a separate entity and does not capture semantic relationships between words. It does not consider the context or meaning of words, which might limit its effectiveness in tasks that require an understanding of word semantics.
  2. Bias towards Frequent Words: It assigns higher importance to words that frequently appear in documents. Consequently, common words like “the,” “and,” or “is” may dominate the feature space while potentially ignoring rarer but more meaningful words.
  3. Lack of Normalization: It does not consider document length, meaning longer documents may have higher token counts than shorter documents, even if they discuss the same topics. This lack of normalization might affect specific analyses and algorithms that rely on document length.
  4. Limited Information: It only captures the frequency of tokens within documents. It does not consider the order or sequence of words, which may be relevant in specific text analysis tasks like sentiment analysis or language modelling.

So how can we solve these issues? What other popular vectorizers are there?


What is the TfidfVectorizer?

TfidfVectorizer stands for “Term Frequency-Inverse Document Frequency Vectorizer.” It builds upon the concept of CountVectorizer but incorporates the TF-IDF weighting scheme. TF-IDF is a numerical statistic that reflects the importance of a term (token) in a document within a larger corpus.

The TF-IDF value for a term in a document is calculated by multiplying the term frequency (TF) and inverse document frequency (IDF) components:

  • Term Frequency (TF) represents the frequency of a term in a document. It is typically calculated as the count of the term in the document divided by the total number of terms in the document.
  • Inverse Document Frequency (IDF) measures the rarity of a term in the corpus. It is calculated as the logarithm of the total number of documents divided by the number of documents that contain the term.

TfidfVectorizer tokenizes the text, counts the term frequencies, and applies the IDF transformation to obtain the TF-IDF representation. It creates a matrix where the rows represent the documents, and the columns represent the tokens. The cell values indicate the TF-IDF weights of each token in each document.

What is the difference between CountVectorizer and TfidfVectorizer?


  • CountVectorizer converts a collection of text documents into a matrix where the rows represent the documents, and the columns represent the tokens (words or n-grams).
  • It counts the occurrences of each token in each document, creating a “document-term matrix” with integer values representing the frequency of each token.
  • CountVectorizer does not consider the importance of tokens; it simply counts the occurrences.
  • It is helpful for tasks where the frequency of tokens is essential, such as text classification or clustering based on word frequency.
Countvectorizer is a simple techniques that counts the amount of times a word occurs

Countvectorizer is a simple technique that counts the number of times a word occurs


  • TfidfVectorizer stands for “Term Frequency-Inverse Document Frequency.”
  • Like CountVectorizer, it converts text documents into a matrix representation.
  • However, TfidfVectorizer considers the frequency of tokens in each document and incorporates the inverse document frequency.
  • The inverse document frequency component down weights the tokens that frequently appear across all documents, giving more weight to rare tokens in the corpus.
  • TfidfVectorizer computes a weight for each token in each document, considering both the term frequency (TF) and inverse document frequency (IDF) aspects.
  • It is helpful for tasks where the frequency and rarity of tokens are essential, such as information retrieval, document ranking, or text summarization.

Comparison Example

Here’s a comparison of CountVectorizer and TfidfVectorizer using the same example:

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample documents
documents = [
    "This is the first document.",
    "This document is the second document.",
    "And this is the third one.",
    "Is this the first document?",

# CountVectorizer
count_vectorizer = CountVectorizer()
X_count = count_vectorizer.fit_transform(documents)

# TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(documents)

# Get the feature names (tokens)
feature_names_count = count_vectorizer.get_feature_names()
feature_names_tfidf = tfidf_vectorizer.get_feature_names()

# Print the feature names
print("CountVectorizer feature names:", feature_names_count)
print("TfidfVectorizer feature names:", feature_names_tfidf)

# Print the document-term matrices
print("CountVectorizer document-term matrix:")

print("TfidfVectorizer document-term matrix:")

The output:

CountVectorizer feature names: ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
TfidfVectorizer feature names: ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

CountVectorizer document-term matrix:
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
TfidfVectorizer document-term matrix:
[[0.         0.40993715 0.57496187 0.40993715 0.         0.         0.40993715 0.         0.40993715]
 [0.         0.8198743  0.         0.26710379 0.         0.         0.26710379 0.         0.26710379]
 [0.52863461 0.         0.         0.26710379 0.52863461 0.         0.26710379 0.52863461 0.26710379]
 [0.         0.40993715 0.57496187 0.40993715 0.         0.         0.40993715 0.         0.40993715]]

In the CountVectorizer document-term matrix, the values represent the frequency of each token in each document. For example, the value 2 in the second row and the second column indicates that the token “document” appears twice in the second document.

In the TfidfVectorizer document-term matrix, the values represent the TF-IDF weight of each token in each document. The TF-IDF weight combines the term frequency (TF) and the inverse document frequency (IDF). Tokens that are more frequent in a specific document and rare in the overall corpus tend to have higher weights. For example, the value 0.8198743 in the second row and the second column indicate a higher weight for the token “document” in the second document than other tokens.

While CountVectorizer focuses solely on token frequency, TfidfVectorizer considers the frequency and rarity of tokens using the TF-IDF weighting scheme. TfidfVectorizer is commonly used to emphasize the importance of rare words and downplay the influence of common words in a document collection.

Other alternatives to CountVectorizer

There are several other alternatives to CountVectorizer for text vectorization in NLP tasks. Here are a few popular ones:

  1. HashingVectorizer: HashingVectorizer is a memory-efficient alternative to CountVectorizer and TfidfVectorizer. Instead of building and storing a vocabulary, it uses a hashing function to convert tokens into numerical representations directly. This approach avoids the need to keep the entire vocabulary in memory but can lead to potential collisions where different tokens might be hashed to the same value.
  2. Word2Vec: Word2Vec is a word embedding technique representing words as dense vectors in a continuous vector space. It captures semantic relationships between words by considering their context in large text corpora. Word2Vec can be trained on large datasets, or pre-trained models can be used for transfer learning. It provides dense, low-dimensional representations that encode semantic information.
  3. GloVe: GloVe (Global Vectors for Word Representation) is another word embedding technique that learns word vectors by factorizing a word co-occurrence matrix. It combines the advantages of global context (capturing global word relationships) and local context (capturing local word relationships). Pretrained GloVe word vectors are available for various languages and can be used for various NLP tasks.
  4. BERT (Bidirectional Encoder Representations from Transformers): BERT is a state-of-the-art language model that uses a transformer architecture to capture contextual information from text. It generates word embeddings that consider both each word’s left and right context. BERT can be fine-tuned on specific tasks or used as a feature extractor to obtain contextualized word representations.

These alternatives offer different approaches and capabilities for text vectorization. The choice depends on the specific task, the available data, the importance of semantic information, and the computational resources at hand.


CountVectorizer is a simple and efficient text preprocessing technique that converts text documents into a numerical representation based on token frequency. It provides a document-term matrix that represents the occurrence of tokens in each document. CountVectorizer is easy to use, computationally efficient, and versatile regarding tokenization options.

However, CountVectorizer has some limitations. It lacks semantic understanding, treating each token separately without capturing semantic relationships. As a result, it can be biased towards frequent words, potentially ignoring rarer but more meaningful words. It does not consider document length, which may impact specific analyses. Additionally, it does not capture word order or context.

Alternative techniques such as TfidfVectorizer, HashingVectorizer, Word2Vec, GloVe, and BERT can address these limitations. In addition, these alternatives offer TF-IDF weighting, memory efficiency, semantic understanding, contextualized word embeddings, and more advanced language modelling capabilities.

The choice of text vectorization technique depends on the specific task, dataset, and requirements. It is essential to consider the trade-offs between simplicity, efficiency, interpretability, semantic understanding, and advanced language modelling capabilities to select the most appropriate technique for a given NLP task.

About the Author

Neri Van Otten

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.

Recent Articles

Factor analysis example of what is a variable and what is a factor

Factor Analysis Made Simple & How To Tutorial In Python

What is Factor Analysis? Factor analysis is a potent statistical method for comprehending complex datasets' underlying structure or patterns. Its primary objective is...

glove vector example "king" is to "queen" as "man" is to "woman"

How To Implement GloVe Embeddings In Python: 3 Tutorials & 9 Alternatives

What are GloVe Embeddings? GloVe, or Global Vectors for Word Representation, is an unsupervised learning algorithm that obtains vector word representations by analyzing...

q-learning explained witha a mouse navigating a maze and updating it's internal staate

Reinforcement Learning: Q-learning & Deep Q-Learning Made Simple

What is Q-learning in Machine Learning? In machine learning, Q-learning is a foundational reinforcement learning technique for decision-making in uncertain...

DALL-E the text description "A cat sitting on a beach chair wearing sunglasses,"

Generative Artificial Intelligence (AI) Made Simple [Complete Guide With Models & Examples]

What is Generative Artificial Intelligence (AI)? Generative artificial intelligence (GAI) is a type of AI that can create new and original content, such as text, music,...

5 key aspects of GPT prompt engineering

How To Guide To Chat-GPT, GPT-3 & GPT-4 Prompt Engineering [10 Types]

What is GPT prompt engineering? GPT prompt engineering is the process of crafting prompts to guide the behaviour of GPT language models, such as Chat-GPT, GPT-3,...

What is LLM Orchestration

How to manage Large Language Models (LLM) — Orchestration Made Simple [5 Frameworks]

What is LLM Orchestration? LLM orchestration is the process of managing and controlling large language models (LLMs) in a way that optimizes their performance and...

Content-Based Recommendation System where a user is recommended similar movies to those they have already watched

How To Build Content-Based Recommendation System Made Easy [Top 8 Algorithms & Python Tutorial]

What is a Content-Based Recommendation System? A content-based recommendation system is a sophisticated breed of algorithms designed to understand and cater to...

Nodes and edges in a knowledge graph

Knowledge Graph: How To Tutorial In Python, LLM Comparison & 23 Tools & Libraries

What is a Knowledge Graph? A Knowledge Graph is a structured representation of knowledge that incorporates entities, relationships, and attributes to create a...

The mixed signals and need to be reverse-engineer to get the original sources with ICA

Independent Component Analysis (ICA) Made Simple & How To Tutorial In Python

What is Independent Component Analysis (ICA)? Independent Component Analysis (ICA) is a powerful and versatile technique in data analysis, offering a unique perspective...


Submit a Comment

Your email address will not be published. Required fields are marked *

nlp trends

2024 NLP Expert Trend Predictions

Get a FREE PDF with expert predictions for 2024. How will natural language processing (NLP) impact businesses? What can we expect from the state-of-the-art models?

Find out this and more by subscribing* to our NLP newsletter.

You have Successfully Subscribed!