Part-of-speech (POS) Tagging In NLP: 4 Python How To Tutorials

by | Jan 24, 2023 | Data Science, Natural Language Processing

What is Part-of-speech (POS) tagging?

Part-of-speech (POS) tagging is fundamental in natural language processing (NLP) and can be done in Python. It involves labelling words in a sentence with their corresponding POS tags. POS tags indicate the grammatical category of a word, such as noun, verb, adjective, adverb, etc. The goal of POS tagging is to determine a sentence’s syntactic structure and identify each word’s role in the sentence.

There are two main types of POS tagging in NLP, and several Python libraries can be used for POS tagging, including NLTK, spaCy, and TextBlob. This article discusses the different types of POS taggers, the advantages and disadvantages of each, and provides code examples for the three most commonly used libraries in Python.

Several libraries do POS tagging in Python in NLP

Several libraries do POS tagging in Python.

Types of Part-of-speech (POS) tagging in NLP

There are two main types of part-of-speech (POS) tagging in natural language processing (NLP):

  1. Rule-based POS tagging uses a set of linguistic rules and patterns to assign POS tags to words in a sentence. This method relies on a predefined set of grammatical rules, a dictionary of words, and their POS tags. The NLTK library’s pos_tag() function is an example of a rule-based POS tagger that uses the Penn Treebank POS tag set.
  2. Statistical POS tagging uses machine learning algorithms, such as Hidden Markov Models (HMM) or Conditional Random Fields (CRF), to predict POS tags based on the context of the words in a sentence. This method requires a large amount of training data to create models. The SpaCy library’s POS tagger is an example of a statistical POS tagger that uses a neural network-based model trained on the OntoNotes 5 corpus.

Both rule-based and statistical POS tagging have their advantages and disadvantages. Rule-based taggers are simpler to implement and understand but less accurate than statistical taggers. Statistical taggers, however, are more accurate but require a large amount of training data and computational resources.

Advantages and disadvantages of the different types of Part-of-speech (POS) tagging for NLP in Python

Rule-based part-of-speech (POS) taggers and statistical POS taggers are two different approaches to POS tagging in natural language processing (NLP). Each method has its advantages and disadvantages.

The benefits of rule-based Part-of-speech (POS) tagging:

  • Simple to implement and understand
  • It doesn’t require a lot of computational resources or training data
  • It can be easily customized to specific domains or languages

Disadvantages of rule-based Part-of-speech (POS) tagging:

  • Less accurate than statistical taggers
  • Limited by the quality and coverage of the rules
  • It can be difficult to maintain and update

The Benefits of Statistical Part-of-speech (POS) Tagging:

  • More accurate than rule-based taggers
  • Don’t require a lot of human-written rules
  • Can learn from large amounts of training data

Disadvantages of statistical Part-of-speech (POS) Tagging:

  • Requires more computational resources and training data
  • It can be difficult to interpret and debug
  • Can be sensitive to the quality and diversity of the training data

In general, for most of the real-world use cases, it’s recommended to use statistical POS taggers, which are more accurate and robust. However, in some cases, the rule-based POS tagger is still useful, for example, for small or specific domains where the training data is unavailable or for specific languages that are not well-supported by existing statistical models.

Rule-based Part-of-speech (POS) tagging for NLP in Python code

1. NLTK Part-of-speech (POS) tagging

One common way to perform POS tagging in Python using the NLTK library is to use the pos_tag() function, which uses the Penn Treebank POS tag set. For example:

import nltk
nltk.download('averaged_perceptron_tagger')

sentence = "I am learning NLP in Python"
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
print(pos_tags)

This will make a list of tuples, each with a word and the POS tag that goes with it.

[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('NLP', 'NNP'), ('in', 'IN'), ('Python', 'NNP')]

It’s also possible to use other POS taggers, like Stanford POS Tagger, or others with better performance, like SpaCy POS Tagger, but they require additional setup and processing.

NLTK POS tagger abbreviations

Here is a list of the available abbreviations and their meaning.

AbbreviationMeaning
CCcoordinating conjunction
CDcardinal digit
DTdeterminer
EXexistential there
FWforeign word
INpreposition/subordinating conjunction
JJThis NLTK POS Tag is an adjective (large)
JJRadjective, comparative (larger)
JJSadjective, superlative (largest)
LSlist market
MDmodal (could, will)
NNnoun, singular (cat, tree)
NNSnoun plural (desks)
NNPproper noun, singular (sarah)
NNPSproper noun, plural (indians or americans)
PDTpredeterminer (all, both, half)
POSpossessive ending (parent\ ‘s)
PRPpersonal pronoun (hers, herself, him, himself)
PRP$possessive pronoun (her, his, mine, my, our )
RBadverb (occasionally, swiftly)
RBRadverb, comparative (greater)
RBSadverb, superlative (biggest)
RPparticle (about)
TOinfinite marker (to)
UHinterjection (goodbye)
VBverb (ask)
VBGverb gerund (judging)
VBDverb past tense (pleaded)
VBNverb past participle (reunified)
VBPverb, present tense not 3rd person singular(wrap)
VBZverb, present tense with 3rd person singular (bases)
WDTwh-determiner (that, what)
WPwh- pronoun (who)
WRBwh- adverb (how)

2. TextBlob Part-of-speech (POS) tagging

Here is an example of how to use the part-of-speech (POS) tagging functionality in the TextBlob library in Python:

from textblob import TextBlob

# Define a sentence
sentence = "I am learning NLP in Python"

# Create a TextBlob object
text_blob = TextBlob(sentence)

# Use the 'tags' property to get the POS tags
pos_tags = text_blob.tags
print(pos_tags)

This will output a list of tuples, where each tuple contains a word and its corresponding POS tag, using the pattern-based POS tagger.

[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('NLP', 'NNP'), ('in', 'IN'), ('Python', 'NNP')]

TextBlob also can tag using a statistical POS tagger. To use the NLTK POS Tagger, you can pass pos_tagger attribute to TextBlob, like this:

from textblob import TextBlob

# Define a sentence
sentence = "I am learning NLP in Python"

# Create a TextBlob object
text_blob = TextBlob(sentence, pos_tagger=nltk.pos_tag)

# Use the 'tags' property to get the POS tags
pos_tags = text_blob.tags
print(pos_tags)

Keep in mind that when using the NLTK POS Tagger, the NLTK library needs to be installed and the pos tagger downloaded.

TextBlob is a useful library for conveniently performing everyday NLP tasks, such as POS tagging, noun phrase extraction, sentiment analysis, etc. It is built on top of NLTK and provides a simple and easy-to-use API.

Statistical Part-of-speech (POS) tagging for NLP in Python code

3. Spacy Part-of-speech (POS) tagging

Here is an example of how to use the part-of-speech (POS) tagging functionality in the spaCy library in Python:

import spacy

# Load the 'en_core_web_sm' model
nlp = spacy.load('en_core_web_sm')

# Define a sentence
sentence = "I am learning NLP in Python"

# Process the sentence using spaCy's NLP pipeline
doc = nlp(sentence)

# Iterate through the token and print the token text and POS tag
for token in doc:
    print(token.text, token.pos_)

This will output the token text and the POS tag for each token in the sentence:

I PRON
am AUX
learning VERB
NLP PROPN
in ADP
Python PROPN

The spaCy library’s POS tagger is based on a statistical model trained on the OntoNotes 5 corpus, and it can tag the text with high accuracy. It also can tag other features, like lemma, dependency, ner, etc.

Note that before running the code, you need to download the model you want to use, in this case, en_core_web_sm . You can do this by running !python -m spacy download en_core_web_sm on your command line.

4. NLTK Part-of-speech (POS) tagging

The Averaged Perceptron Tagger in NLTK is a statistical part-of-speech (POS) tagger that uses a machine learning algorithm called Averaged Perceptron. Here is an example of how to use it in Python:

import nltk
nltk.download('averaged_perceptron_tagger')

sentence = "I am learning NLP in Python"

# Tokenize the sentence
tokens = nltk.word_tokenize(sentence)

# POS tagging using the Averaged Perceptron Tagger
pos_tags = nltk.pos_tag(tokens, tagset='universal', tagger='averaged_perceptron')
print(pos_tags)

This will output a list of tuples, where each tuple contains a word and its corresponding POS tag, using the Averaged Perceptron Tagger

[('I', 'PRON'), ('am', 'VERB'), ('learning', 'VERB'), ('NLP', 'NOUN'), ('in', 'ADP'), ('Python', 'NOUN')]

You can see that the output tags are different from the previous example because the Averaged Perceptron Tagger uses the universal POS tagset, which is different from the Penn Treebank POS tagset.

The averaged perceptron tagger is trained on a large corpus of text, which makes it more robust and accurate than the default rule-based tagger provided by NLTK. It also allows you to specify the tagset, which is the set of POS tags that can be used for tagging; in this case, it’s using the ‘universal’ tagset, which is a cross-lingual tagset, useful for many NLP tasks in Python.

It’s important to note that the Averaged Perceptron Tagger requires loading the model before using it, which is why it’s necessary to download it using the nltk.download() function.

Conclusion

In conclusion, part-of-speech (POS) tagging is essential in natural language processing (NLP) and can be easily implemented using Python. The process involves labelling words in a sentence with their corresponding POS tags. There are two main types of POS tagging: rule-based and statistical.

Rule-based POS taggers use a set of linguistic rules and patterns to assign POS tags to words in a sentence. They are simple to implement and understand but less accurate than statistical taggers. The NLTK library’s pos_tag() function is an example of a rule-based POS tagger that uses the Penn Treebank POS tag set.

Statistical POS taggers use machine learning algorithms, such as Hidden Markov Models (HMM) or Conditional Random Fields (CRF), to predict POS tags based on the context of the words in a sentence. They are more accurate but require much training data and computational resources. The SpaCy library’s POS tagger is an example of a statistical POS tagger that uses a neural network-based model trained on the OntoNotes 5 corpus.

Both rule-based and statistical POS tagging have their advantages and disadvantages. Rule-based taggers are simpler to implement and understand but less accurate than statistical taggers. Statistical taggers, however, are more accurate but require a large amount of training data and computational resources. In general, for most of the real-world use cases, it’s recommended to use statistical POS taggers, which are more accurate and robust.

About the Author

Neri Van Otten

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.

Recent Articles

Factor analysis example of what is a variable and what is a factor

Factor Analysis Made Simple & How To Tutorial In Python

What is Factor Analysis? Factor analysis is a potent statistical method for comprehending complex datasets' underlying structure or patterns. Its primary objective is...

glove vector example "king" is to "queen" as "man" is to "woman"

How To Implement GloVe Embeddings In Python: 3 Tutorials & 9 Alternatives

What are GloVe Embeddings? GloVe, or Global Vectors for Word Representation, is an unsupervised learning algorithm that obtains vector word representations by analyzing...

q-learning explained witha a mouse navigating a maze and updating it's internal staate

Reinforcement Learning: Q-learning & Deep Q-Learning Made Simple

What is Q-learning in Machine Learning? In machine learning, Q-learning is a foundational reinforcement learning technique for decision-making in uncertain...

DALL-E the text description "A cat sitting on a beach chair wearing sunglasses,"

Generative Artificial Intelligence (AI) Made Simple [Complete Guide With Models & Examples]

What is Generative Artificial Intelligence (AI)? Generative artificial intelligence (GAI) is a type of AI that can create new and original content, such as text, music,...

5 key aspects of GPT prompt engineering

How To Guide To Chat-GPT, GPT-3 & GPT-4 Prompt Engineering [10 Types]

What is GPT prompt engineering? GPT prompt engineering is the process of crafting prompts to guide the behaviour of GPT language models, such as Chat-GPT, GPT-3,...

What is LLM Orchestration

How to manage Large Language Models (LLM) — Orchestration Made Simple [5 Frameworks]

What is LLM Orchestration? LLM orchestration is the process of managing and controlling large language models (LLMs) in a way that optimizes their performance and...

Content-Based Recommendation System where a user is recommended similar movies to those they have already watched

How To Build Content-Based Recommendation System Made Easy [Top 8 Algorithms & Python Tutorial]

What is a Content-Based Recommendation System? A content-based recommendation system is a sophisticated breed of algorithms designed to understand and cater to...

Nodes and edges in a knowledge graph

Knowledge Graph: How To Tutorial In Python, LLM Comparison & 23 Tools & Libraries

What is a Knowledge Graph? A Knowledge Graph is a structured representation of knowledge that incorporates entities, relationships, and attributes to create a...

The mixed signals and need to be reverse-engineer to get the original sources with ICA

Independent Component Analysis (ICA) Made Simple & How To Tutorial In Python

What is Independent Component Analysis (ICA)? Independent Component Analysis (ICA) is a powerful and versatile technique in data analysis, offering a unique perspective...

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

nlp trends

2024 NLP Expert Trend Predictions

Get a FREE PDF with expert predictions for 2024. How will natural language processing (NLP) impact businesses? What can we expect from the state-of-the-art models?

Find out this and more by subscribing* to our NLP newsletter.

You have Successfully Subscribed!