Natural Language Processing (NLP) feature engineering involves transforming raw textual data into numerical features that can be input into machine learning models. Feature engineering is a crucial step in NLP, as it determines the effectiveness of the models built for the task.
In this article, we summarise the 8 most common NLP feature engineering techniques and provide each one’s advantages and disadvantages with code examples in Python to get you started.
We further provide NLP feature engineering techniques specifically for social media data. Social media data is often messy and in a shorter form and requires a different approach.
To make our feature engineering more efficient, we frequently combine multiple pre-processing techniques to create a pre-processing pipeline. Check out our previous blog post on how to create a functional pre-processing pipeline.
Feature engineering is like building a pipeline. Lots of bits need to work together to get the final overall result.
This involves splitting text into individual words or tokens. Tokenization is often the first step in NLP feature engineering.
Advantages
Disadvantages
Example
Here’s an example in Python using the NLTK library:
import nltk
from nltk.tokenize import word_tokenize
text = "The quick brown fox jumped over the lazy dog."
tokens = word_tokenize(text) print(tokens)
Output:
['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog', '.']
In this example, the text is tokenized into individual words or tokens, which can be used for further analysis.
Stop words are common words without meaning, such as “the” or “and”. Removing stop words can reduce the feature space’s dimensionality and improve the model’s efficiency.
Advantages
Disadvantages
Example
Here’s an example of stop-word removal in Python using the NLTK library:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
text = "The quick brown fox jumped over the lazy dog."
stop_words = set(stopwords.words('english'))
tokens = word_tokenize(text)
filtered_tokens = [token for token in tokens if not token in stop_words]
print(filtered_tokens)
Output:
['The', 'quick', 'brown', 'fox', 'jumped', 'lazy', 'dog', '.']
In this example, the stop words in the English language are removed from the text, leaving only the important words for analysis. The resulting tokens can be used for further analysis.
Stemming involves reducing words to their base form by removing suffixes, while lemmatization involves reducing words to their base form by mapping them to their dictionary form. Both techniques can help reduce the dimensionality of the feature space and improve model accuracy.
Advantages
Disadvantages
Example
Here’s an example of stemming and lemmatization in Python using the NLTK library:
Stemming
import nltk
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
text = "The quick brown foxes jumped over the lazy dogs."
stemmer = PorterStemmer()
tokens = word_tokenize(text)
stemmed_tokens = [stemmer.stem(token) for token in tokens]
print(stemmed_tokens)
Output:
['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazi', 'dog', '.']
In this example, the words are reduced to their stem form, which can help identify patterns and relationships in the text.
Lemmatization
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
text = "The quick brown foxes jumped over the lazy dogs."
lemmatizer = WordNetLemmatizer()
tokens = word_tokenize(text)
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
print(lemmatized_tokens)
Output:
['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog', '.']
In this example, the words are reduced to their base form, which can help identify patterns and relationships in the text. Compared to stemming, lemmatization is generally considered to produce more accurate results.
N-grams are sequences of adjacent words of length n
. N-grams can capture more context and help the model better understand the meaning of the text.
Advantages
Disadvantages
Example
Here’s an example of generating n-grams in Python using the NLTK library:
import nltk
from nltk.util import ngrams
text = "The quick brown foxes jumped over the lazy dogs."
tokens = nltk.word_tokenize(text)
bigrams = ngrams(tokens, 2) trigrams = ngrams(tokens, 3)
print(list(bigrams))
print(list(trigrams))
Output:
[('The', 'quick'), ('quick', 'brown'), ('brown', 'foxes'), ('foxes', 'jumped'), ('jumped', 'over'), ('over', 'the'), ('the', 'lazy'), ('lazy', 'dogs'), ('dogs', '.')] [('The', 'quick', 'brown'), ('quick', 'brown', 'foxes'), ('brown', 'foxes', 'jumped'), ('foxes', 'jumped', 'over'), ('jumped', 'over', 'the'), ('over', 'the', 'lazy'), ('the', 'lazy', 'dogs'), ('lazy', 'dogs', '.')]
In this example, bigrams and trigrams are generated from the text, which can be used to identify patterns and relationships between adjacent words or phrases in the data.
POS tagging involves labelling each word in a sentence with its corresponding part of speech, such as a noun, verb, or adjective. This information can be used to build more advanced features for the model.
Advantages
Disadvantages
Example
Here’s an example of POS tagging in Python using the NLTK library:
import nltk
text = "The quick brown foxes jumped over the lazy dogs."
tokens = nltk.word_tokenize(text)
pos_tags = nltk.pos_tag(tokens)
print(pos_tags)
Output:
[('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('foxes', 'NNS'), ('jumped', 'VBD'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dogs', 'NNS'), ('.', '.')]
In this example, each word in the text is assigned a specific part-of-speech tag based on its context and usage. The resulting POS tags can identify patterns and relationships between different parts of speech in the text.
NER involves identifying and categorizing named entities in text, such as people, organizations, and locations. This information can also be used to build more advanced features for the model.
Advantages
Disadvantages
Example
Here’s an example of NER in Python using the NLTK library:
import nltk
text = "John Smith works at Google in New York City."
tokens = nltk.word_tokenize(text)
pos_tags = nltk.pos_tag(tokens)
ner_tags = nltk.ne_chunk(pos_tags)
print(ner_tags)
Output:
(S (PERSON John/NNP) (PERSON Smith/NNP) works/VBZ at/IN (ORGANIZATION Google/NNP) in/IN (GPE New/NNP York/NNP City/NNP) ./.)
In this example, named entities such as PERSON, ORGANIZATION, and GPE (geo-political entity) are identified and classified. The resulting NER tags can extract important information from the text and identify patterns and relationships between named entities.
Term Frequency-Inverse Document Frequency (TF-IDF) is a technique that assigns weights to words based on their frequency in the document and the corpus. This can help the model identify important words or phrases in the text.
Advantages
Disadvantages
Example
Here’s an example of TF-IDF in Python using the scikit-learn library:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [ "The quick brown fox jumps over the lazy dog.",
"The quick brown foxes jump over the lazy dogs and cats.",
"The lazy dogs and cats watch the quick brown foxes jump over the moon."]
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(corpus)
print(tfidf_matrix.toarray())
Output:
[[0. 0. 0. 0.51785612 0. 0. 0. 0. 0. 0. 0.68091856 0.51785612 0. ]
[0. 0. 0. 0.46519584 0. 0. 0.59817854 0. 0. 0. 0. 0.46519584 0.59817854]
[0.33682422 0.33682422 0.33682422 0.30794004 0.33682422 0.33682422 0. 0.33682422 0.33682422 0.33682422 0. 0.30794004 0. ]]
In this example, TF-IDF represents the importance of each word in a corpus of three documents. The resulting TF-IDF matrix can be used to identify important keywords and concepts in the corpus, measure the relevance of a document to a query or search term, and cluster similar documents based on the similarity of their content.
Word embeddings are a neural network-based technique that converts words into dense vectors that capture semantic meaning. Word embeddings can be used as input features for deep learning models.
Advantages
Disadvantages
Example
Here’s an example of word embeddings in Python using the Gensim library:
from gensim.models import Word2Vec
sentences = [ "The quick brown fox jumps over the lazy dog".split(),
"The lazy dog watches the quick brown fox".split(),
"The quick brown cat jumps over the lazy dog".split(),
"The lazy dog watches the quick brown cat".split() ]
model = Word2Vec(sentences, min_count=1)
print(model['quick'])
NLP feature engineering for social media data presents some unique challenges due to the informal nature of the language used and the abundance of noise in the data. Here are some techniques that can be used specifically for social media data:
NLP feature engineering techniques such as tokenization, stop word removal, stemming and lemmatization, n-grams, POS tagging, Named Entity Recognition, TF-IDF, and Word Embeddings are essential for the processing and analyzing text data in natural language processing.
Each method has pros and cons, and which one to use depends on the specific use case and the kind of text data being analyzed.
For example, tokenization and stop-word removal are basic techniques that can help simplify text data and make it more manageable. Stemming and lemmatization can help reduce the number of words to analyze while still capturing their essence. N-grams can help capture context and relationships between words, while POS tagging and Named Entity Recognition can help identify the grammatical structure and named entities in text data. TF-IDF can help identify the most important words in a document or corpus, while Word Embeddings can capture more complex relationships between words.
Overall, NLP feature engineering is an important part of natural language processing, and it can help you get meaningful insights and information from text data.
Have you ever wondered why raising interest rates slows down inflation, or why cutting down…
Introduction Reinforcement Learning (RL) has seen explosive growth in recent years, powering breakthroughs in robotics,…
Introduction Imagine a group of robots cleaning a warehouse, a swarm of drones surveying a…
Introduction Imagine trying to understand what someone said over a noisy phone call or deciphering…
What is Structured Prediction? In traditional machine learning tasks like classification or regression a model…
Introduction Reinforcement Learning (RL) is a powerful framework that enables agents to learn optimal behaviours…