How To Implement Keyword Extraction [3 Ways In Python With NLTK, SpaCy & BERT]

What is Keyword extraction?

Keyword extraction is figuring out which words and phrases in a piece of text are the most important. These keywords can be used to summarise the content of the text. A common use case is using keywords to improve search engine optimization (SEO) and make content more easily discoverable online.

Natural language processing (NLP) methods like part-of-speech tagging and phrase chunking are used in many keyword extraction methods. These methods can help you find the most important ideas and objects in a text and the most common words and phrases.

Part-of-speech tagging is used in many keyword extraction techniques.

Another popular keyword extraction method is term frequency-inverse document frequency (TF-IDF) analysis. With this method, you figure out how important each word in a document is by comparing how often it appears in that document to how often it appears in a group of documents. Some words appear a lot in a document, but only a few are considered important keywords for that document.

Why is keyword extraction important?

Keyword extraction is vital for several reasons.

It helps summarize the content of a document: By identifying the most critical keywords and phrases in a piece of text, it is possible to understand its main topics and themes quickly. This can be useful for summarizing a document’s content and organizing and categorizing it for easier retrieval and analysis.
It improves search engine optimization (SEO): By including the most relevant and popular keywords in the titles, headings, and body of a web page, it is possible to improve its visibility and ranking in search engine results pages (SERPs). This can help increase the likelihood that the page will be discovered by users searching for information on a particular topic, leading to more traffic and engagement.
It improves content marketing: By identifying the keywords and phrases that are most popular and relevant to a particular topic or industry, creating content that resonates with target audiences and attracts more traffic and engagement is possible. Keyword extraction can help identify the topics and trends currently most relevant and popular and use this information to create timely and relevant content for target audiences.
It improves customer service: By analyzing customer inquiries and feedback, it is possible to identify the most common questions and concerns and use this information to improve the quality and effectiveness of customer service responses. Keyword extraction can help identify the issues and problems most commonly raised by customers. This information can improve the quality and relevance of customer service interactions.

How does it work?

Keyword extraction involves using natural language processing (NLP) techniques to identify the essential words and phrases in a text automatically. This can be done using a variety of methods, including the following:

Part-of-speech tagging: This involves using algorithms to identify the parts of speech (e.g. nouns, verbs, adjectives) of each word in a text. It is possible to extract the main subjects and objects discussed in the text by identifying the most commonly used nouns and other content words.
Phrase chunking involves using algorithms to identify common phrases and patterns in a text. By identifying the most commonly used terms, it is possible to extract the main ideas and themes discussed in the text.
Term frequency–inverse document frequency (TF-IDF) analysis: This involves calculating the relative importance of each word in a document by comparing its frequency in that document to its frequency across a corpus of documents. Words frequently appearing in a particular document but not in many others are considered essential keywords for that document.

Overall, keyword extraction is a way to automatically find the most important words and phrases in a text by using NLP. This information is then used to summarise the text’s content and make it easier to find.

Machine learning algorithms

Several machine learning algorithms can be used for keyword extraction, including the following:

Supervised learning algorithms require a pre-labelled training dataset, where the input data (i.e. the text) has already been manually annotated with the relevant keywords. The algorithm uses this training dataset to learn the patterns and associations between the input data and the labels and can then be applied to new, unseen data to identify the relevant keywords automatically.
Unsupervised learning algorithms: These algorithms do not require a pre-labelled training dataset and instead learn the patterns and associations in the data automatically through clustering and clustering. Unsupervised learning algorithms can be used to identify the most commonly used words and phrases in a text and the relationships between different words and phrases.
Semi-supervised learning algorithms combine supervised and unsupervised learning elements and can be helpful when only a tiny amount of pre-labelled training data is available. The algorithm uses the pre-labelled data to learn the patterns and associations between the input data and the labels. It then uses unsupervised learning techniques to identify the relevant keywords in new, unseen data.

Overall, many different machine learning algorithms can be used for keyword extraction. The appropriate algorithm will depend on the specific characteristics and goals of the task.

How to implement keyword extraction

Preprocess the text: Before extracting keywords from a text, it is essential to preprocess the text to remove any irrelevant, noisy information or stop words. This may include eliminating punctuation, special characters, and numbers, converting the text to lowercase and stemming or lemmatizing the words.
Identify essential words and phrases: Many techniques can identify the most important words and phrases in a text, including part-of-speech tagging, phrase chunking, and term frequency-inverse document frequency (TF-IDF) analysis. These techniques can help identify the main subjects and objects discussed in the text and the most commonly used words and phrases.
Filter and rank the keywords: Once the most important words and phrases have been identified, it is essential to filter out any irrelevant or redundant keywords and rank the remaining keywords according to their relevance and importance. This can be done using various techniques, including statistical measures such as term frequency-inverse document frequency, domain-specific knowledge and expertise.
Use the keywords: Once they have been extracted and ranked, they can summarize the text’s content, it can further be used to enhance the discoverability and relevance of the document. In SEO, this may include using the keywords in the titles, headings, and body of a web page to improve its ranking or using them to create relevant and engaging content for target audiences.

Python library example implementations

1. NLTK keyword extraction

Here is an example of keyword extraction using the NLTK (Natural Language Toolkit) library in Python:

import nltk 

# Preprocess the text by removing punctuation and converting to lowercase 
text = "This is a sample text for keyword extraction." 
text = text.lower().replace(".", "") 

# Tokenize the text into words 
tokens = nltk.word_tokenize(text) 

# Use part-of-speech tagging to identify the nouns in the text 
tags = nltk.pos_tag(tokens) 
nouns = [word for (word, tag) in tags if tag == "NN"] 

# Use term frequency-inverse document frequency (TF-IDF) analysis to rank the nouns 
from sklearn.feature_extraction.text import TfidfVectorizer 
vectorizer = TfidfVectorizer() 
tfidf = vectorizer.fit_transform([text]) 

# Get the top 3 most important nouns 
top_nouns = sorted(vectorizer.vocabulary_, key=lambda x: tfidf[0, vectorizer.vocabulary_[x]], reverse=True)[:3] 

# Print the top 3 keywords 
print(top_nouns)

This example preprocesses the text by removing punctuation and converting it to lowercase. Then, part-of-speech tagging is used to find the nouns in the text, and TF-IDF analysis is used to rank the nouns by how important they are. Finally, it prints the top 3 most important nouns, which in this case would be “keyword”, “extraction”, and “sample”.

2. SpaCy keyword extraction

Here is an example of keyword extraction using the Spacy library in Python:

import spacy 

# Load the Spacy model and create a new document 
nlp = spacy.load("en_core_web_sm") 
doc = nlp("This is a sample text for keyword extraction.") 

# Use the noun_chunks property of the document to identify the noun phrases in the text 
noun_phrases = [chunk.text for chunk in doc.noun_chunks] 

# Use term frequency-inverse document frequency (TF-IDF) analysis to rank the noun phrases 
from sklearn.feature_extraction.text import TfidfVectorizer 
vectorizer = TfidfVectorizer() 
tfidf = vectorizer.fit_transform([doc.text]) 

# Get the top 3 most important noun phrases 
top_phrases = sorted(vectorizer.vocabulary_, key=lambda x: tfidf[0, vectorizer.vocabulary_[x]], reverse=True)[:3] 

# Print the top 3 keywords 
print(top_phrases)

This example first loads the Spacy model and creates a new document from the input text. Then, it uses the noun_chunks property of the document to identify the noun phrases in the text, and uses TF-IDF analysis to rank the noun phrases according to their importance. Finally, it prints the top 3 most important noun phrases, which in this case would be “keyword extraction”, “sample text”, and “sample”.

3. BERT keyword extraction

BERT (Bidirectional Encoder Representations from Transformers) is a powerful language model that can be used for various natural language processing tasks, including keyword extraction. It is trained on a large corpus of text data and learns to encode the meaning and context of words and phrases in a text, allowing it to accurately identify the most important words and phrases in a document.

Here is an example of keyword extraction using BERT in Python:

import transformers 

# Load the BERT model and create a new tokenizer 
model = transformers.BertModel.from_pretrained("bert-base-uncased") 
tokenizer = transformers.BertTokenizer.from_pretrained("bert-base-uncased") 

# Tokenize and encode the text 
input_ids = tokenizer.encode("This is a sample text for keyword extraction.", add_special_tokens=True) 

# Use BERT to encode the meaning and context of the words and phrases in the text 
outputs = model(torch.tensor([input_ids])) 

# Use the attention weights of the tokens to identify the most important words and phrases 
attention_weights = outputs[-1] 
top_tokens = sorted(attention_weights[0], key=lambda x: x[1], reverse=True)[:3] 

# Decode the top tokens and print the top 3 keywords 
top_keywords = [tokenizer.decode([token[0]]) for token in top_tokens] 
print(top_keywords)

This example loads the BERT model and tokenizer and then uses the tokenizer to tokenize and encode the input text. Next, it uses BERT to encode the meaning and context of the words and phrases in the text. It then uses the attention weights of the tokens to identify the most important words and phrases. Finally, it decodes the top tokens and prints the top 3 keywords. In this case, this would be “keyword”, “extraction”, and “sample”.

Key Takeaways

Keyword extraction is the process of finding important information in the text. This can be done in various ways with many different algorithms. What algorithms you use will mostly depend on your use case, but a good place to get started is using the TF-IDF algorithm. Then depending on the results, you could focus on spending your time on more pre-processing to remove the unwanted keywords or switch to a different method to put more importance on a certain type of keyword.

The whole process can be straightforward or super complicated, depending on the keywords you want and the data you have. You might want to look at our NER article if you need specific named entities extracted.

What is your favourite keyword extracting algorithm or library? Let us know in the comments.

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.