Word embedding is used in natural language processing (NLP) to describe how words are represented for text analysis. Typically, this representation takes the form of a real-valued vector that encodes the meaning of the word. The expected output is that words close to one another in the vector space will have similar meanings.
Table of Contents
Word embeddings can be created by mapping vocabulary words or phrases to real numbers vectors. This is doen using various language modelling and feature-learning techniques. Neural networks, dimensionality reduction on the word co-occurrence matrix, probabilistic models, and explicit representation in terms of word context are some techniques used to create this mapping.
Using word and phrase embeddings as the underlying input representation have been demonstrated to improve performance in many NLP tasks. As a result, it’s one of the critical breakthroughs. Combined with deep learning, it has allowed us to solve much more challenging NLP problems.
Word embedding adds context to words for better automatic language understanding applications.
Word embedding is one of the top ten most used NLP techniques.
What is word embedding?
Words with the same meaning are represented similarly in word embedding, a learned representation of text. One of the significant advances in deep learning for complex natural language processing problems is this method of representing words and documents.
Individual words are represented as real-valued vectors in a predefined vector space in a technique known as “word embedding.” The method is frequently referred to as “deep learning” because each word is assigned to a single vector, and the vector values are learned in a manner resembling a neural network.
Using a densely distributed representation for each word is essential to the method. A real-valued vector with frequently tens or hundreds of dimensions is used to represent each word. In contrast, sparse word representations, like a one-hot encoding or TF-IDF, call for thousands or millions of dimensions.
The word usage and the distributed representation must be learned to derive correct word vectors for word embedding. Word embeddings make it possible for words used similarly to have representations that capture their meaning naturally compared to the “bag of words” model, where different terms have different representations regardless of how they are used.
What are the 3 main word embedding algorithms?
A statistical technique called Word2Vec can be effectively used to learn a standalone word embedding from a text corpus. It was created by Tomas Mikolov and colleagues at Google in 2013 in an effort to improve the effectiveness of embedding training using neural networks. It has since taken over as the industry norm. The work also included investigating how vector math applied to word representations and analysing the learned vectors.
A typical example used to explain word vectors is the phrase, “the king is to the queen as a man is to a woman.” If we take the male gender out of the word “king” and add the female gender, we would arrive at the word “queen.” In this way, we can start to reason with words through the relationships that they hold in regard to other words.
“The king is to the queen as a man is to a woman.” – Word2Vec understands the context.
Two new learning models were presented to learn the word embedding using the word2vec method.
- Continuous Bag-of-Words (CBOW) model
- Continuous Skip-Gram Model
The CBOW model learns the embedding by predicting the next word based on the current word’s context. On the other hand, the continuous skip-gram model learns the embedding for the current word by predicting the words that will be around it.
Both models emphasise learning words based on their context, so the words are close by. A window of nearby words, therefore, determines the context of a word. This window is a model parameter that can be adjusted according to a given use case.
The method’s main advantage is its ability to learn high-quality word embeddings quickly, enabling the learning of more significant embeddings from high-dimensional data. As a result, much larger corpora of text with billions of words can be easily represented.
The word2vec algorithm has been extended to create the Global Vectors for Word Representation (GloVe) algorithm. GloVe is based on word-context matrix factorisation techniques. It first creates a sizable matrix of (words x context) co-occurrence data, in which you count the number of times a word appears in a particular “context” (the columns) for each “word” (the rows).
Naturally, there are many “contexts” because their size is essentially combinatorial. When this matrix is factorised, a lower-dimensional (words x features) matrix is produced, with each row creating a vector representation for the corresponding word. Typically, this is accomplished by reducing a “reconstruction loss.” This loss looks for lower-dimensional models that can account for the majority of the variance in the high-dimensional data.
GloVe creates an explicit word context or word co-occurrence matrix using statistics across the entire text corpus rather than using a window to define local context, like in Word2Vec. The outcome is a learning model that might lead to more effective word embeddings.
FastText, which is essentially a word2vec model extension, treats each word as being made up of character n-grams. Thus, the sum of these character n-grams constitutes the vector for a word. For instance, the word vector “orange” is the sum of the n-gram vectors:
"<or", "ora", "oran", "orang", "orange" "orange>", "ran", "rang", "range" "range>", "ang", "ange", "ange>", "nge","nge>", "ge", "ge>"
The use of n-grams is the primary distinction between FastText and Word2Vec.
Word2Vec uses only complete words found in the training corpus to learn vectors. In contrast, FastText learns vectors for individual words and the n-grams found within them. The mean of the target word vector and its n-gram component vectors are used for training at each stage of the FastText process.
Each combined vector creates the target, which is then uniformly updated using the adjustment calculated from the error. These calculations significantly increase the amount of computation in the training phase. A word must add up and average each of its n-gram component parts at each point.
Through various metrics, it has been demonstrated that these vectors are more accurate than Word2Vec vectors.
The most notable enhancement to FastText is the N-gram feature, which addresses the OOV (out-of-vocabulary) problem. For instance, the word “aquarium” can be broken down into “aq/aqu/qua/uar/ari/riu/ium/um>,” where “<” and “>” denote the beginning and end of the word, respectively. Though the word embedder may not immediately recognise the word “Aquarius,” it can infer its meaning. This can be done from fact that the words “aquarium” and “Aquarius” share a common root.
How to use word embeddings in your projects
As we have just discussed, there are three popular word embedding algorithms. Which one you should use depends entirely on your use case and the data and processing power you have available. You could either train your embeddings or use an existing model that has already been trained for you.
Learn your word embedding
Learning your embeddings is a good solution when you have a training data set and the computational resources to train the model. Going down this route has the advantage of training your model optimised for your use case. If this is done correctly, you can yield far better results than a pre-trained model. When developing your word embedding, you have two primary choices:
- Learn it independently, in which case a model is trained to learn the embedding, which is then saved and used as a component of another model for your task in the future. This is a good strategy if you want to use the same embedding in various models.
- Learn it jointly, where the embedding is learned as a component of a sizable model tailored to a given task. This is a good strategy if you only plan to use the embedding for one task.
Reuse existing word embeddings
Pre-trained word embeddings are frequently made freely available by researchers under a permissive license so that you can use them in your research or business endeavours. For instance, word2vec and GloVe word embeddings can be downloaded without charge. So instead of creating your embeddings from scratch, you can use these in your project. When it comes to using pre-trained embeddings, you have two primary choices:
- The static option. This means that the embedding is used as part of your model but is kept static. This strategy is appropriate if the embedding is a good fit for your issue and produces valuable results.
- The update option. This is where the model is seeded with the previously trained embedding, but the embedding is jointly updated throughout the model training process. This might be a good option if you want to make the most of the model by embedding it in your task.
- Word embeddings have revolutionised the world of natural language processing. We can now reason with text in a way that was impossible before with a bag of words or the TF-IDF word vectorisation technique.
- There are three main word embedding algorithms; word2vec, GloVe, and FastText. All three have slightly different implementations and have their advantages and disadvantages. Understanding these differences will let you choose the correct algorithm for your task.
- Depending on your problem, your data, and the processing power available to train a model, you might train your embeddings or use a pre-trained model instead.
- Check out this article on sentence embedding to take it one step further.
What word embeddings have you used, or are you interested in training? Let us know in the comments below.