Understanding & Implementing The Continuous Bag-of-Words (CBOW) Model

by | Jul 27, 2023 | Artificial Intelligence, Machine Learning, Natural Language Processing

Introduction to word embeddings

Word embeddings have become a cornerstone of Natural Language Processing (NLP), transforming how machines process and understand human language. These vector representations of words capture the semantic meaning and relationships between words, enabling algorithms to work with text data effectively. Among various word embedding techniques, one that stands out for its simplicity and efficiency is the Continuous Bag-of-Words (CBOW) model. This comprehensive blog post will delve into CBOW to understand its theoretical foundations, working principles, applications in NLP, and more.

What is the Continuous Bag-of-Words model?

Continuous bag-of-words (CBOW) is a neural network model for learning word embeddings. Word embeddings are distributed representations of words that capture the semantic and syntactic relationships between words. CBOW predicts a target word given the context words in a sentence.

The CBOW model is a shallow neural network with three layers: an input layer, a hidden layer, and an output layer. The input layer represents the context words in a sentence. The hidden layer learns the word embeddings. The output layer predicts the target word.

Continuous bag-of-words is a shallow neural network

The CBOW model is a shallow neural network

The CBOW model is trained using a supervised learning algorithm. The training data consists of pairs of (context_window, target_word) where the context_window is a set of words surrounding the target word. The model is trained to predict the target word given the context_window.

CBOW is a simple and efficient model that can be trained on large datasets. It is a good choice for text classification and natural language understanding tasks.

Here is an example of how CBOW works:

The quick brown fox jumps over the lazy dog.

The context_window for the word “fox” would be the words “the”, “quick”, and “brown”. The CBOW model would be trained to predict the word “fox” given the context_window.

The CBOW model is a powerful tool for learning word embeddings. It is simple, efficient, and can be trained on large datasets. CBOW is a good choice for text classification and natural language understanding tasks.

How to train a Continuous Bag-of-Words (CBOW) model

Training the Continuous Bag-of-Words (CBOW) model is crucial in obtaining word embeddings that can effectively capture semantic relationships in a given corpus. In this section, we will explore the training process of the CBOW model, including data preprocessing, building the context window, creating input-output pairs, defining the neural network architecture, and optimizing the model’s parameters.

  1. Data Preprocessing: Before training the CBOW model, we need to preprocess the text data. The preprocessing steps typically involve tokenization, removing punctuation, converting text to lowercase, and handling special characters. Additionally, we may remove stopwords (common words with little semantic value) and perform stemming or lemmatization to reduce word variations to their base forms.
  2. Building the Context Window: The central idea behind CBOW is to predict a target word based on its surrounding context words. To achieve this, we define a context window size, which determines the number of words on either side of the target word that will be considered context words. A larger context window allows the model to capture more context but may lead to increased computational overhead.
  3. Creating Input-Output Pairs: Once the context window is defined, we slide it over the preprocessed text data. We extract the context words within the window for each target word to create input-output pairs. For example, if the context window size is set to 2, and the sentence is “The quick brown fox jumps,” the input-output pairs would be:

    – Input: [The, brown] Output: quick
    – Input: [quick, fox] Output: brown
    – Input: [brown, jumps] Output: fox
  4. Defining the Neural Network Architecture: We can define the CBOW neural network architecture with ready input-output pairs. The architecture usually consists of an input layer, a hidden layer, and an output layer. Each word in the context window will be represented as a one-hot encoded vector in the input layer. The hidden layer contains the embedding layer, where the word representations are learned, and the output layer predicts the target word.
  5. Training the CBOW Model: The training process involves feeding the input-output pairs into the CBOW model and adjusting the model’s parameters to minimize the prediction error. Common optimization algorithms, such as stochastic gradient descent (SGD) or Adam, update the model’s weights during training. The optimization process aims to find the word embeddings that best capture the semantic relationships between words in the corpus.
  6. Softmax Activation Function: The output layer of the CBOW model typically employs the softmax activation function. Softmax converts the raw output scores into a probability distribution, allowing the model to predict the most likely word for a given context. The target word’s one-hot encoded vector is compared to the predicted probability distribution, and the error is backpropagated through the network to update the model’s parameters.
  7. Training Epochs and Batch Size: During training, we iterate over the input-output pairs multiple times, known as epochs. The number of epochs determines how often the training dataset is processed. Additionally, input-output pairs are usually divided into batches to accelerate training and utilize parallelism. The batch size is a hyperparameter that controls the number of samples processed in each training step.
  8. Evaluation during Training: To monitor the CBOW model’s training progress and prevent overfitting, it is essential to evaluate the model’s performance on a validation set during training. The validation set contains input-output pairs that are distinct from the training set. By assessing the model’s performance on this set, we can determine if it is generalizing well to unseen data and whether it is appropriate to stop training or make adjustments.
  9. Hyperparameter Tuning: As mentioned previously, CBOW has several hyperparameters, including the context window size, embedding dimension, learning rate, and batch size. Hyperparameter tuning involves systematically experimenting with different combinations of hyperparameters to find the optimal configuration that yields the best performance on the validation set.

Training the CBOW model involves preprocessing the text data, creating input-output pairs, defining the neural network architecture, and optimizing the model’s parameters using an optimization algorithm. By training the CBOW model on a large corpus of text data, we obtain word embeddings that capture the contextual relationships between words, empowering us to leverage these embeddings for various downstream NLP tasks, such as word similarity, text classification, and sentiment analysis. The success of the CBOW model lies in its ability to efficiently produce meaningful word representations, facilitating better language understanding and enhancing the performance of NLP applications.

How can you evaluate word embeddings?

Word embedding evaluation is crucial in assessing the quality and effectiveness of word embeddings generated by models like Continuous Bag-of-Words (CBOW). The evaluation aims to determine how well the word embeddings capture semantic relationships, word similarities, and other linguistic properties. This section will explore common evaluation methods for word embeddings, including similarity tasks, analogy tasks, and word clustering.

1. Word Similarity Tasks

Word similarity tasks evaluate how well word embeddings represent the semantic similarity between pairs of words. The evaluation typically involves a set of word pairs, each associated with a human-assigned similarity score. These similarity scores can be obtained from human judgments or datasets where human subjects rate the similarity between words.

To evaluate word embeddings using word similarity tasks, the following steps are commonly followed:

a. Compute Similarity Scores: Calculate the cosine similarity or other distance metrics between the embeddings of each word pair in the evaluation set.

b. Correlation Analysis: Compare the computed and human-assigned similarity scores. Common correlation metrics used are Pearson correlation or Spearman rank correlation. Higher correlation values indicate that the word embeddings capture semantic similarity well.

2. Analogy Tasks

Analogy tasks evaluate the model’s ability to perform linguistic reasoning by completing analogies of the form “A is to B as C is to ?”. The evaluation dataset consists of analogy questions, such as “king is to queen as man is to ?” or “Paris is to France as Rome is to ?”. The goal is to find the word whose embedding vector is closest to the vector difference (e.g., king – man + woman ≈ queen).

The steps to evaluate word embeddings using analogy tasks are as follows:

a. Find Analogies: For each analogy question, compute the vector differences between the word embeddings and identify the word whose embedding is closest to the resulting vector.

b. Accuracy Analysis: Measure the model’s accuracy in answering the analogy questions correctly. The higher the accuracy, the better the embeddings capture linguistic relationships and analogies.

3. Word Clustering

Word clustering evaluation assesses how well word embeddings group similar words together. Clustering similar words is an essential property for word embeddings, indicating that the embeddings capture meaningful semantic relationships.

To evaluate word embeddings using word clustering, the process typically involves:

a. Clustering: Apply a clustering algorithm, such as k-means or hierarchical clustering, to group word embeddings based on their similarities.

b. Cluster Validation: Evaluate the quality of the clusters using metrics like silhouette score or Davies-Bouldin index. Higher scores indicate better clustering performance.

Word embedding evaluation is an iterative process, and researchers may use different evaluation datasets and tasks depending on their specific goals. Additionally, evaluating word embeddings is not a one-size-fits-all approach, as the effectiveness of embeddings may vary based on the underlying corpus, domain, and specific NLP tasks.

What applications use Continuous Bag-of-Words (CBOW) in Natural Language Processing?

The CBOW model finds diverse applications across various NLP tasks. CBOW can calculate the similarity between word pairs in word similarity and relatedness tasks, aiding in information retrieval and question-answering systems. Additionally, CBOW plays a significant role in text classification and sentiment analysis tasks, where understanding the sentiment of a piece of text is essential for numerous applications, including social media monitoring and customer feedback analysis.

Furthermore, the CBOW model is helpful in Named Entity Recognition (NER) and part-of-speech tagging, where it helps identify entities and their corresponding categories in unstructured text.

Continuous Bag-of-Words (CBOW) vs Other Word Embedding Models

CBOW vs skip-gram

CBOW and skip-gram are two neural network models used to learn word embeddings. Word embeddings are distributed representations of words that capture the semantic and syntactic relationships between words.


  • Continuous bag-of-words (CBOW) is a neural network model that predicts a target word given the context words in a sentence.
  • The CBOW model is a shallow neural network with three layers: an input layer, a hidden layer, and an output layer.
  • The input layer represents the context words in a sentence.
  • The hidden layer learns the word embeddings.
  • The output layer predicts the target word.


  • Continuous skip-gram (skip-gram) is a neural network model that predicts the context words given the target word.
  • The skip-gram model is also a shallow neural network with three layers: an input layer, a hidden layer, and an output layer.
  • The input layer represents the target word.
  • The hidden layer learns the word embeddings.
  • The output layer predicts the context words.


The main difference between CBOW and skip-gram is the way they predict words. CBOW predicts the target word given the context words, while skip-gram predicts the context words given the target word.

cbow vs skip-gram

Source: Exploiting Similarities among Languages for Machine Translation

CBOW is generally considered more efficient than skip-gram, as it only needs to predict a single target word at a time. However, skip-gram can learn more fine-grained word representations, as it can see the context words in different orders.

Which one to use?

The choice of which model to use depends on the specific task. If efficiency is essential, then CBOW may be a better choice. If fine-grained word representations are meaningful, skip-gram may be a better choice.

CBOW is generally a good choice for text classification and natural language understanding tasks. At the same time, skip-gram is a good choice for natural language generation and machine translation tasks.

Here is a table summarizing the differences between CBOW and skip-gram:

PredictsTarget word given context wordsContext words given target word
EfficiencyMore efficientLess efficient
Fine-grained word representationsLess fine-grainedMore fine-grained
TaskText classification, natural language understandingNatural language generation, machine translation

How can you overcoming challenges and limitations associated with the CBOW model?

Continuous Bag-of-Words (CBOW) is a powerful word embedding technique; however, it does have some challenges and limitations. This section will explore these challenges and discuss strategies to overcome them.

  1. Handling Out-of-Vocabulary (OOV) Words: One of the primary challenges of CBOW is dealing with words that are not present in the training corpus (OOV words). Since CBOW relies on pre-defined word embeddings for each word in the vocabulary, it struggles to handle unseen words during inference. Researchers often use subword embeddings, such as FastText, which can take unseen words by composing embeddings based on subword units to address this issue. This way, even if a word is OOV, its subword components can still contribute to its representation.
  2. Polysemy and Homonymy: Polysemy refers to words that have multiple meanings, while homonymy refers to different words that share the same form. CBOW treats each word as a single entity and does not differentiate between other word senses. This limitation can lead to ambiguous word embeddings that do not capture the subtle nuances of word meanings. To tackle this challenge, context-sensitive word embedding models like contextualized word embeddings (e.g., BERT) have been developed, which learn representations based on the surrounding context of each word occurrence. Such models capture different meanings of a word based on its context, addressing the issue of polysemy and homonymy.
  3. Data Sparsity: CBOW, like other neural network models, requires much data to learn meaningful representations effectively. For languages with limited resources or specialized domains with scarce data, obtaining high-quality word embeddings with CBOW can be challenging. To mitigate data sparsity, researchers may resort to transfer learning, fine-tuning pre-trained embeddings from larger corpora on their specific dataset or domain. This approach leverages the knowledge captured by embeddings from the broader domain to improve performance in the target domain.
  4. Context Window Size Selection: The choice of context window size is critical in CBOW. A small window may not provide sufficient context information, while a large window may introduce noise and dilute the relevant information. The optimal context window size often depends on the specific NLP task and the dataset’s characteristics. Experimentation and hyperparameter tuning are essential to finding the appropriate context window size for a particular application.
  5. Scalability: CBOW, like other neural network-based models, can be computationally intensive, especially for large vocabularies and datasets. Training a CBOW model on massive text corpora may require significant computational resources and time. To overcome this challenge, researchers may consider using word2vec libraries like Gensim, which efficiently implement CBOW and other word embedding techniques that can leverage multi-core processors and distributed computing.

Pre-trained Continuous Bag-of-Words (CBOW) embeddings

Pre-trained CBOW embeddings are word representations that have been pre-computed using the Continuous Bag-of-Words (CBOW) model on large-scale text corpora. These embeddings capture semantic relationships between words and are trained on vast amounts of text data, making them valuable resources for various Natural Language Processing (NLP) tasks. Pre-trained CBOW embeddings serve as a starting point for NLP projects, providing a foundation for word representations without the need to train a model from scratch.

Advantages of Pre-trained CBOW Embeddings

  1. Generalization: Pre-trained CBOW embeddings are trained on extensive and diverse text corpora, enabling them to capture general semantic relationships and context across different domains and languages. As a result, they can be effectively applied to a wide range of NLP tasks.
  2. Dimensionality Reduction: Word embeddings significantly reduce the dimensionality of the word space. Pre-training the CBOW model allows researchers to obtain word embeddings in a lower-dimensional space while preserving essential semantic information.
  3. Time and Resource Efficiency: Training word embeddings from scratch can be computationally expensive and time-consuming, especially on large datasets. Pre-trained embeddings save time and computational resources, allowing researchers to focus on specific NLP tasks.
  4. OOV Handling: Out-of-vocabulary (OOV) words, i.e., words not seen during training, can be problematic for models trained from scratch. Pre-trained embeddings often handle OOV words by providing similar representations for related words, even if the exact word is unseen during training.

Popular Pre-trained CBOW Embeddings

  1. Word2Vec: Word2Vec is one of the most well-known pre-trained CBOW models. Developed by Google, Word2Vec generates word embeddings by training CBOW and Skip-gram models on vast text corpora. Word2Vec embeddings have been widely used in various NLP applications and are available in different dimensions (e.g., 50, 100, 300) to suit different requirements.
  2. GloVe: Global Vectors for Word Representation (GloVe) is another popular pre-trained word embedding technique. Unlike Word2Vec, GloVe combines global word co-occurrence statistics to learn word embeddings. GloVe embeddings are available in various dimensions and are known for their efficiency and ability to capture global context.
  3. FastText: FastText is an extension of Word2Vec that considers subword information. It generates word embeddings by averaging subword embeddings, which helps handle morphologically rich languages and OOV words more effectively.

Using Pre-trained CBOW Embeddings

Integrating pre-trained CBOW embeddings into an NLP project is straightforward. You can simply load the pre-trained embeddings, map words from their vocabulary to the pre-trained embedding space, and utilize them as inputs for downstream NLP tasks. However, fine-tuning pre-trained embeddings on a domain-specific corpus is often recommended to adapt the embeddings to the specific task, domain, or context.

How to implement Continuous Bag-of-Words (CBOW) with Python and TensorFlow

Implementing Continuous Bag-of-Words (CBOW) with Python involves setting up the environment, preparing the data, creating the CBOW neural network architecture, training the model, and evaluating its performance. Below is a step-by-step guide to implementing CBOW using Python and TensorFlow, one of the popular deep learning frameworks for NLP.

1. Set Up the Environment: Ensure you install Python and TensorFlow. You can install TensorFlow using pip:

pip install tensorflow 

2. Prepare the Data: Load your text corpus and preprocess it. Tokenize the sentences, remove punctuation, convert text to lowercase, and create a vocabulary with unique words. Assign an index to each word in the vernacular.

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Sample corpus
corpus = [
    "the quick brown fox jumps",
    "over the lazy dog",
    "hello world",
    # Add more sentences as needed

# Tokenize and create vocabulary
tokenizer = Tokenizer()
word_index = tokenizer.word_index
vocab_size = len(word_index) + 1

3. Create Input-Output Pairs: For CBOW, create input-output pairs by sliding a context window over the sentences. The context window size determines the number of words on either side of the target word to be considered context words.

import numpy as np

context_window = 2

def generate_data(corpus, context_window, tokenizer):
    sequences = tokenizer.texts_to_sequences(corpus)
    X, y = [], []
    for sequence in sequences:
        for i in range(context_window, len(sequence) - context_window):
            context = sequence[i - context_window : i] + sequence[i + 1 : i + context_window + 1]
            target = sequence[i]
    return np.array(X), np.array(y)

X_train, y_train = generate_data(corpus, context_window, tokenizer)

4. Create CBOW Model Architecture: Define the CBOW neural network architecture using TensorFlow. The model consists of an embedding layer, followed by an average pooling layer, and a dense output layer.

embedding_dim = 100

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=context_window*2),
    tf.keras.layers.Dense(vocab_size, activation='softmax')

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')

5. Train the CBOW Model: Train the CBOW model using the prepared input-output pairs.

epochs = 50
batch_size = 16

model.fit(X_train, y_train, epochs=epochs, batch_size=batch_size)

6. Evaluate the CBOW Model: After training, you can evaluate the CBOW model’s performance on word similarity tasks, analogy tasks, or any other specific NLP evaluation task.

# Perform evaluation on test data if available
test_loss, test_accuracy = model.evaluate(X_test, y_test, batch_size=batch_size)
print(f"Test Loss: {test_loss}, Test Accuracy: {test_accuracy}")

Implementing Continuous Bag-of-Words (CBOW) with Python and TensorFlow involves data preprocessing, defining the CBOW neural network architecture, training the model, and evaluating its performance. Following these steps, you can create word embeddings using CBOW and utilize them for various NLP tasks, such as word similarity, sentiment analysis, and text classification. Remember to adjust hyperparameters, context window size, and other settings based on your specific NLP task and dataset for optimal results.


The Continuous Bag-of-Words (CBOW) model is a powerful word embedding technique that significantly contributes to various Natural Language Processing tasks. By grasping its theoretical foundations, exploring its practical implementation, and understanding its strengths and limitations, we can unlock the full potential of CBOW for improving language understanding, information retrieval, and other AI applications. As NLP research advances, CBOW and other word embedding models will continue to evolve, empowering machines to comprehend and interact with human language more effectively.

About the Author

Neri Van Otten

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.

Recent Articles

ROC curve

ROC And AUC Curves In Machine Learning Made Simple & How To Tutorial In Python

What are ROC and AUC Curves in Machine Learning? The ROC Curve The ROC (Receiver Operating Characteristic) curve is a graphical representation used to evaluate the...

decision boundaries for naive bayes

Naive Bayes Classification Made Simple & How To Tutorial In Python

What is Naive Bayes? Naive Bayes classifiers are a group of supervised learning algorithms based on applying Bayes' Theorem with a strong (naive) assumption that every...

One class SVM anomaly detection plot

How To Implement Anomaly Detection With One-Class SVM In Python

What is One-Class SVM? One-class SVM (Support Vector Machine) is a specialised form of the standard SVM tailored for unsupervised learning tasks, particularly anomaly...

decision tree example of weather to play tennis

Decision Trees In ML Complete Guide [How To Tutorial, Examples, 5 Types & Alternatives]

What are Decision Trees? Decision trees are versatile and intuitive machine learning models for classification and regression tasks. It represents decisions and their...

graphical representation of an isolation forest

Isolation Forest For Anomaly Detection Made Easy & How To Tutorial

What is an Isolation Forest? Isolation Forest, often abbreviated as iForest, is a powerful and efficient algorithm designed explicitly for anomaly detection. Introduced...

Illustration of batch gradient descent

Batch Gradient Descent In Machine Learning Made Simple & How To Tutorial In Python

What is Batch Gradient Descent? Batch gradient descent is a fundamental optimization algorithm in machine learning and numerical optimisation tasks. It is a variation...

Techniques for bias detection in machine learning

Bias Mitigation in Machine Learning [Practical How-To Guide & 12 Strategies]

In machine learning (ML), bias is not just a technical concern—it's a pressing ethical issue with profound implications. As AI systems become increasingly integrated...

text similarity python

Full-Text Search Explained, How To Implement & 6 Powerful Tools

What is Full-Text Search? Full-text search is a technique for efficiently and accurately retrieving textual data from large datasets. Unlike traditional search methods...

the hyperplane in a support vector regression (SVR)

Support Vector Regression (SVR) Simplified & How To Tutorial In Python

What is Support Vector Regression (SVR)? Support Vector Regression (SVR) is a machine learning technique for regression tasks. It extends the principles of Support...


Submit a Comment

Your email address will not be published. Required fields are marked *

nlp trends

2024 NLP Expert Trend Predictions

Get a FREE PDF with expert predictions for 2024. How will natural language processing (NLP) impact businesses? What can we expect from the state-of-the-art models?

Find out this and more by subscribing* to our NLP newsletter.

You have Successfully Subscribed!