Part-of-speech (POS) Tagging In NLP: 4 Python How To Tutorials

by | Jan 24, 2023 | Data Science, Natural Language Processing

What is Part-of-speech (POS) tagging?

Part-of-speech (POS) tagging is fundamental in natural language processing (NLP) and can be done in Python. It involves labelling words in a sentence with their corresponding POS tags. POS tags indicate the grammatical category of a word, such as noun, verb, adjective, adverb, etc. The goal of POS tagging is to determine a sentence’s syntactic structure and identify each word’s role in the sentence.

There are two main types of POS tagging in NLP, and several Python libraries can be used for POS tagging, including NLTK, spaCy, and TextBlob. This article discusses the different types of POS taggers, the advantages and disadvantages of each, and provides code examples for the three most commonly used libraries in Python.

Several libraries do POS tagging in Python in NLP

Several libraries do POS tagging in Python.

Types of Part-of-speech (POS) tagging in NLP

There are two main types of part-of-speech (POS) tagging in natural language processing (NLP):

  1. Rule-based POS tagging uses a set of linguistic rules and patterns to assign POS tags to words in a sentence. This method relies on a predefined set of grammatical rules, a dictionary of words, and their POS tags. The NLTK library’s pos_tag() function is an example of a rule-based POS tagger that uses the Penn Treebank POS tag set.
  2. Statistical POS tagging uses machine learning algorithms, such as Hidden Markov Models (HMM) or Conditional Random Fields (CRF), to predict POS tags based on the context of the words in a sentence. This method requires a large amount of training data to create models. The SpaCy library’s POS tagger is an example of a statistical POS tagger that uses a neural network-based model trained on the OntoNotes 5 corpus.

Both rule-based and statistical POS tagging have their advantages and disadvantages. Rule-based taggers are simpler to implement and understand but less accurate than statistical taggers. Statistical taggers, however, are more accurate but require a large amount of training data and computational resources.

Advantages and disadvantages of the different types of Part-of-speech (POS) tagging for NLP in Python

Rule-based part-of-speech (POS) taggers and statistical POS taggers are two different approaches to POS tagging in natural language processing (NLP). Each method has its advantages and disadvantages.

The benefits of rule-based Part-of-speech (POS) tagging:

  • Simple to implement and understand
  • It doesn’t require a lot of computational resources or training data
  • It can be easily customized to specific domains or languages

Disadvantages of rule-based Part-of-speech (POS) tagging:

  • Less accurate than statistical taggers
  • Limited by the quality and coverage of the rules
  • It can be difficult to maintain and update

The Benefits of Statistical Part-of-speech (POS) Tagging:

  • More accurate than rule-based taggers
  • Don’t require a lot of human-written rules
  • Can learn from large amounts of training data

Disadvantages of statistical Part-of-speech (POS) Tagging:

  • Requires more computational resources and training data
  • It can be difficult to interpret and debug
  • Can be sensitive to the quality and diversity of the training data

In general, for most of the real-world use cases, it’s recommended to use statistical POS taggers, which are more accurate and robust. However, in some cases, the rule-based POS tagger is still useful, for example, for small or specific domains where the training data is unavailable or for specific languages that are not well-supported by existing statistical models.

Rule-based Part-of-speech (POS) tagging for NLP in Python code

1. NLTK Part-of-speech (POS) tagging

One common way to perform POS tagging in Python using the NLTK library is to use the pos_tag() function, which uses the Penn Treebank POS tag set. For example:

import nltk
nltk.download('averaged_perceptron_tagger')

sentence = "I am learning NLP in Python"
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
print(pos_tags)

This will make a list of tuples, each with a word and the POS tag that goes with it.

[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('NLP', 'NNP'), ('in', 'IN'), ('Python', 'NNP')]

It’s also possible to use other POS taggers, like Stanford POS Tagger, or others with better performance, like SpaCy POS Tagger, but they require additional setup and processing.

NLTK POS tagger abbreviations

Here is a list of the available abbreviations and their meaning.

AbbreviationMeaning
CCcoordinating conjunction
CDcardinal digit
DTdeterminer
EXexistential there
FWforeign word
INpreposition/subordinating conjunction
JJThis NLTK POS Tag is an adjective (large)
JJRadjective, comparative (larger)
JJSadjective, superlative (largest)
LSlist market
MDmodal (could, will)
NNnoun, singular (cat, tree)
NNSnoun plural (desks)
NNPproper noun, singular (sarah)
NNPSproper noun, plural (indians or americans)
PDTpredeterminer (all, both, half)
POSpossessive ending (parent\ ‘s)
PRPpersonal pronoun (hers, herself, him, himself)
PRP$possessive pronoun (her, his, mine, my, our )
RBadverb (occasionally, swiftly)
RBRadverb, comparative (greater)
RBSadverb, superlative (biggest)
RPparticle (about)
TOinfinite marker (to)
UHinterjection (goodbye)
VBverb (ask)
VBGverb gerund (judging)
VBDverb past tense (pleaded)
VBNverb past participle (reunified)
VBPverb, present tense not 3rd person singular(wrap)
VBZverb, present tense with 3rd person singular (bases)
WDTwh-determiner (that, what)
WPwh- pronoun (who)
WRBwh- adverb (how)

2. TextBlob Part-of-speech (POS) tagging

Here is an example of how to use the part-of-speech (POS) tagging functionality in the TextBlob library in Python:

from textblob import TextBlob

# Define a sentence
sentence = "I am learning NLP in Python"

# Create a TextBlob object
text_blob = TextBlob(sentence)

# Use the 'tags' property to get the POS tags
pos_tags = text_blob.tags
print(pos_tags)

This will output a list of tuples, where each tuple contains a word and its corresponding POS tag, using the pattern-based POS tagger.

[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('NLP', 'NNP'), ('in', 'IN'), ('Python', 'NNP')]

TextBlob also can tag using a statistical POS tagger. To use the NLTK POS Tagger, you can pass pos_tagger attribute to TextBlob, like this:

from textblob import TextBlob

# Define a sentence
sentence = "I am learning NLP in Python"

# Create a TextBlob object
text_blob = TextBlob(sentence, pos_tagger=nltk.pos_tag)

# Use the 'tags' property to get the POS tags
pos_tags = text_blob.tags
print(pos_tags)

Keep in mind that when using the NLTK POS Tagger, the NLTK library needs to be installed and the pos tagger downloaded.

TextBlob is a useful library for conveniently performing everyday NLP tasks, such as POS tagging, noun phrase extraction, sentiment analysis, etc. It is built on top of NLTK and provides a simple and easy-to-use API.

Statistical Part-of-speech (POS) tagging for NLP in Python code

3. Spacy Part-of-speech (POS) tagging

Here is an example of how to use the part-of-speech (POS) tagging functionality in the spaCy library in Python:

import spacy

# Load the 'en_core_web_sm' model
nlp = spacy.load('en_core_web_sm')

# Define a sentence
sentence = "I am learning NLP in Python"

# Process the sentence using spaCy's NLP pipeline
doc = nlp(sentence)

# Iterate through the token and print the token text and POS tag
for token in doc:
    print(token.text, token.pos_)

This will output the token text and the POS tag for each token in the sentence:

I PRON
am AUX
learning VERB
NLP PROPN
in ADP
Python PROPN

The spaCy library’s POS tagger is based on a statistical model trained on the OntoNotes 5 corpus, and it can tag the text with high accuracy. It also can tag other features, like lemma, dependency, ner, etc.

Note that before running the code, you need to download the model you want to use, in this case, en_core_web_sm . You can do this by running !python -m spacy download en_core_web_sm on your command line.

4. NLTK Part-of-speech (POS) tagging

The Averaged Perceptron Tagger in NLTK is a statistical part-of-speech (POS) tagger that uses a machine learning algorithm called Averaged Perceptron. Here is an example of how to use it in Python:

import nltk
nltk.download('averaged_perceptron_tagger')

sentence = "I am learning NLP in Python"

# Tokenize the sentence
tokens = nltk.word_tokenize(sentence)

# POS tagging using the Averaged Perceptron Tagger
pos_tags = nltk.pos_tag(tokens, tagset='universal', tagger='averaged_perceptron')
print(pos_tags)

This will output a list of tuples, where each tuple contains a word and its corresponding POS tag, using the Averaged Perceptron Tagger

[('I', 'PRON'), ('am', 'VERB'), ('learning', 'VERB'), ('NLP', 'NOUN'), ('in', 'ADP'), ('Python', 'NOUN')]

You can see that the output tags are different from the previous example because the Averaged Perceptron Tagger uses the universal POS tagset, which is different from the Penn Treebank POS tagset.

The averaged perceptron tagger is trained on a large corpus of text, which makes it more robust and accurate than the default rule-based tagger provided by NLTK. It also allows you to specify the tagset, which is the set of POS tags that can be used for tagging; in this case, it’s using the ‘universal’ tagset, which is a cross-lingual tagset, useful for many NLP tasks in Python.

It’s important to note that the Averaged Perceptron Tagger requires loading the model before using it, which is why it’s necessary to download it using the nltk.download() function.

Conclusion

In conclusion, part-of-speech (POS) tagging is essential in natural language processing (NLP) and can be easily implemented using Python. The process involves labelling words in a sentence with their corresponding POS tags. There are two main types of POS tagging: rule-based and statistical.

Rule-based POS taggers use a set of linguistic rules and patterns to assign POS tags to words in a sentence. They are simple to implement and understand but less accurate than statistical taggers. The NLTK library’s pos_tag() function is an example of a rule-based POS tagger that uses the Penn Treebank POS tag set.

Statistical POS taggers use machine learning algorithms, such as Hidden Markov Models (HMM) or Conditional Random Fields (CRF), to predict POS tags based on the context of the words in a sentence. They are more accurate but require much training data and computational resources. The SpaCy library’s POS tagger is an example of a statistical POS tagger that uses a neural network-based model trained on the OntoNotes 5 corpus.

Both rule-based and statistical POS tagging have their advantages and disadvantages. Rule-based taggers are simpler to implement and understand but less accurate than statistical taggers. Statistical taggers, however, are more accurate but require a large amount of training data and computational resources. In general, for most of the real-world use cases, it’s recommended to use statistical POS taggers, which are more accurate and robust.

About the Author

Neri Van Otten

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.

Recent Articles

different types of data masking

Data Masking Explained, Different Types & How To Implement It

Understanding the Basics of Data Masking Data masking is a critical process in data security designed to protect sensitive information from unauthorised access while...

types of data transformation processes

What Is Data Transformation? 17 Powerful Tools And Technologies

What is Data Transformation? Data transformation is converting data from its original format or structure into a format more suitable for analysis, storage, or...

Real time vs batch processing

Real-time Vs Batch Processing Made Simple: What Is The Difference?

What is Real-Time Processing? Real-time processing refers to the immediate or near-immediate handling of data as it is received. Unlike traditional methods, where data...

what is churn prediction?

Churn Prediction Made Simple & Top 9 ML Techniques

What is Churn prediction? Churn prediction is the process of identifying customers who are likely to stop using a company's products or services in the near future....

the federated architecture used for federated learning

Federated Learning Made Simple, Why its Important & Application in the Real World

What is Federated Learning? Federated Learning (FL) is a cutting-edge machine learning approach emphasising privacy and decentralisation. Unlike traditional machine...

cloud vs edge computing

NLP And Edge Computing: How It Works & Top 7 Technologies for Offline Computing

In the age of digital transformation, Natural Language Processing (NLP) has emerged as a cornerstone of intelligent applications. From chatbots and voice assistants to...

elastic net vs l1 and l2 regularization

Elastic Net Made Simple & How To Tutorial In Python

What is Elastic Net Regression? Elastic Net regression is a statistical and machine learning technique that combines the strengths of Ridge (L2) and Lasso (L1)...

how recursive feature engineering works

Recursive Feature Elimination (RFE) Made Simple: How To Tutorial

What is Recursive Feature Elimination? In machine learning, data often holds the key to unlocking powerful insights. However, not all data is created equal. Some...

high dimensional dat challenges

How To Handle High-Dimensional Data In Machine Learning [Complete Guide]

What is High-Dimensional Data? High-dimensional data refers to datasets that contain a large number of features or variables relative to the number of observations or...

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

nlp trends

2025 NLP Expert Trend Predictions

Get a FREE PDF with expert predictions for 2025. How will natural language processing (NLP) impact businesses? What can we expect from the state-of-the-art models?

Find out this and more by subscribing* to our NLP newsletter.

You have Successfully Subscribed!