Part-of-speech (POS) Tagging In NLP: 4 Python How To Tutorials

by | Jan 24, 2023 | Data Science, Natural Language Processing

What is Part-of-speech (POS) tagging?

Part-of-speech (POS) tagging is fundamental in natural language processing (NLP) and can be done in Python. It involves labelling words in a sentence with their corresponding POS tags. POS tags indicate the grammatical category of a word, such as noun, verb, adjective, adverb, etc. The goal of POS tagging is to determine a sentence’s syntactic structure and identify each word’s role in the sentence.

There are two main types of POS tagging in NLP, and several Python libraries can be used for POS tagging, including NLTK, spaCy, and TextBlob. This article discusses the different types of POS taggers, the advantages and disadvantages of each, and provides code examples for the three most commonly used libraries in Python.

Several libraries do POS tagging in Python in NLP

Several libraries do POS tagging in Python.

Types of Part-of-speech (POS) tagging in NLP

There are two main types of part-of-speech (POS) tagging in natural language processing (NLP):

  1. Rule-based POS tagging uses a set of linguistic rules and patterns to assign POS tags to words in a sentence. This method relies on a predefined set of grammatical rules, a dictionary of words, and their POS tags. The NLTK library’s pos_tag() function is an example of a rule-based POS tagger that uses the Penn Treebank POS tag set.
  2. Statistical POS tagging uses machine learning algorithms, such as Hidden Markov Models (HMM) or Conditional Random Fields (CRF), to predict POS tags based on the context of the words in a sentence. This method requires a large amount of training data to create models. The SpaCy library’s POS tagger is an example of a statistical POS tagger that uses a neural network-based model trained on the OntoNotes 5 corpus.

Both rule-based and statistical POS tagging have their advantages and disadvantages. Rule-based taggers are simpler to implement and understand but less accurate than statistical taggers. Statistical taggers, however, are more accurate but require a large amount of training data and computational resources.

Advantages and disadvantages of the different types of Part-of-speech (POS) tagging for NLP in Python

Rule-based part-of-speech (POS) taggers and statistical POS taggers are two different approaches to POS tagging in natural language processing (NLP). Each method has its advantages and disadvantages.

The benefits of rule-based Part-of-speech (POS) tagging:

  • Simple to implement and understand
  • It doesn’t require a lot of computational resources or training data
  • It can be easily customized to specific domains or languages

Disadvantages of rule-based Part-of-speech (POS) tagging:

  • Less accurate than statistical taggers
  • Limited by the quality and coverage of the rules
  • It can be difficult to maintain and update

The Benefits of Statistical Part-of-speech (POS) Tagging:

  • More accurate than rule-based taggers
  • Don’t require a lot of human-written rules
  • Can learn from large amounts of training data

Disadvantages of statistical Part-of-speech (POS) Tagging:

  • Requires more computational resources and training data
  • It can be difficult to interpret and debug
  • Can be sensitive to the quality and diversity of the training data

In general, for most of the real-world use cases, it’s recommended to use statistical POS taggers, which are more accurate and robust. However, in some cases, the rule-based POS tagger is still useful, for example, for small or specific domains where the training data is unavailable or for specific languages that are not well-supported by existing statistical models.

Rule-based Part-of-speech (POS) tagging for NLP in Python code

1. NLTK Part-of-speech (POS) tagging

One common way to perform POS tagging in Python using the NLTK library is to use the pos_tag() function, which uses the Penn Treebank POS tag set. For example:

import nltk
nltk.download('averaged_perceptron_tagger')

sentence = "I am learning NLP in Python"
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
print(pos_tags)

This will make a list of tuples, each with a word and the POS tag that goes with it.

[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('NLP', 'NNP'), ('in', 'IN'), ('Python', 'NNP')]

It’s also possible to use other POS taggers, like Stanford POS Tagger, or others with better performance, like SpaCy POS Tagger, but they require additional setup and processing.

NLTK POS tagger abbreviations

Here is a list of the available abbreviations and their meaning.

AbbreviationMeaning
CCcoordinating conjunction
CDcardinal digit
DTdeterminer
EXexistential there
FWforeign word
INpreposition/subordinating conjunction
JJThis NLTK POS Tag is an adjective (large)
JJRadjective, comparative (larger)
JJSadjective, superlative (largest)
LSlist market
MDmodal (could, will)
NNnoun, singular (cat, tree)
NNSnoun plural (desks)
NNPproper noun, singular (sarah)
NNPSproper noun, plural (indians or americans)
PDTpredeterminer (all, both, half)
POSpossessive ending (parent\ ‘s)
PRPpersonal pronoun (hers, herself, him, himself)
PRP$possessive pronoun (her, his, mine, my, our )
RBadverb (occasionally, swiftly)
RBRadverb, comparative (greater)
RBSadverb, superlative (biggest)
RPparticle (about)
TOinfinite marker (to)
UHinterjection (goodbye)
VBverb (ask)
VBGverb gerund (judging)
VBDverb past tense (pleaded)
VBNverb past participle (reunified)
VBPverb, present tense not 3rd person singular(wrap)
VBZverb, present tense with 3rd person singular (bases)
WDTwh-determiner (that, what)
WPwh- pronoun (who)
WRBwh- adverb (how)

2. TextBlob Part-of-speech (POS) tagging

Here is an example of how to use the part-of-speech (POS) tagging functionality in the TextBlob library in Python:

from textblob import TextBlob

# Define a sentence
sentence = "I am learning NLP in Python"

# Create a TextBlob object
text_blob = TextBlob(sentence)

# Use the 'tags' property to get the POS tags
pos_tags = text_blob.tags
print(pos_tags)

This will output a list of tuples, where each tuple contains a word and its corresponding POS tag, using the pattern-based POS tagger.

[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('NLP', 'NNP'), ('in', 'IN'), ('Python', 'NNP')]

TextBlob also can tag using a statistical POS tagger. To use the NLTK POS Tagger, you can pass pos_tagger attribute to TextBlob, like this:

from textblob import TextBlob

# Define a sentence
sentence = "I am learning NLP in Python"

# Create a TextBlob object
text_blob = TextBlob(sentence, pos_tagger=nltk.pos_tag)

# Use the 'tags' property to get the POS tags
pos_tags = text_blob.tags
print(pos_tags)

Keep in mind that when using the NLTK POS Tagger, the NLTK library needs to be installed and the pos tagger downloaded.

TextBlob is a useful library for conveniently performing everyday NLP tasks, such as POS tagging, noun phrase extraction, sentiment analysis, etc. It is built on top of NLTK and provides a simple and easy-to-use API.

Statistical Part-of-speech (POS) tagging for NLP in Python code

3. Spacy Part-of-speech (POS) tagging

Here is an example of how to use the part-of-speech (POS) tagging functionality in the spaCy library in Python:

import spacy

# Load the 'en_core_web_sm' model
nlp = spacy.load('en_core_web_sm')

# Define a sentence
sentence = "I am learning NLP in Python"

# Process the sentence using spaCy's NLP pipeline
doc = nlp(sentence)

# Iterate through the token and print the token text and POS tag
for token in doc:
    print(token.text, token.pos_)

This will output the token text and the POS tag for each token in the sentence:

I PRON
am AUX
learning VERB
NLP PROPN
in ADP
Python PROPN

The spaCy library’s POS tagger is based on a statistical model trained on the OntoNotes 5 corpus, and it can tag the text with high accuracy. It also can tag other features, like lemma, dependency, ner, etc.

Note that before running the code, you need to download the model you want to use, in this case, en_core_web_sm . You can do this by running !python -m spacy download en_core_web_sm on your command line.

4. NLTK Part-of-speech (POS) tagging

The Averaged Perceptron Tagger in NLTK is a statistical part-of-speech (POS) tagger that uses a machine learning algorithm called Averaged Perceptron. Here is an example of how to use it in Python:

import nltk
nltk.download('averaged_perceptron_tagger')

sentence = "I am learning NLP in Python"

# Tokenize the sentence
tokens = nltk.word_tokenize(sentence)

# POS tagging using the Averaged Perceptron Tagger
pos_tags = nltk.pos_tag(tokens, tagset='universal', tagger='averaged_perceptron')
print(pos_tags)

This will output a list of tuples, where each tuple contains a word and its corresponding POS tag, using the Averaged Perceptron Tagger

[('I', 'PRON'), ('am', 'VERB'), ('learning', 'VERB'), ('NLP', 'NOUN'), ('in', 'ADP'), ('Python', 'NOUN')]

You can see that the output tags are different from the previous example because the Averaged Perceptron Tagger uses the universal POS tagset, which is different from the Penn Treebank POS tagset.

The averaged perceptron tagger is trained on a large corpus of text, which makes it more robust and accurate than the default rule-based tagger provided by NLTK. It also allows you to specify the tagset, which is the set of POS tags that can be used for tagging; in this case, it’s using the ‘universal’ tagset, which is a cross-lingual tagset, useful for many NLP tasks in Python.

It’s important to note that the Averaged Perceptron Tagger requires loading the model before using it, which is why it’s necessary to download it using the nltk.download() function.

Conclusion

In conclusion, part-of-speech (POS) tagging is essential in natural language processing (NLP) and can be easily implemented using Python. The process involves labelling words in a sentence with their corresponding POS tags. There are two main types of POS tagging: rule-based and statistical.

Rule-based POS taggers use a set of linguistic rules and patterns to assign POS tags to words in a sentence. They are simple to implement and understand but less accurate than statistical taggers. The NLTK library’s pos_tag() function is an example of a rule-based POS tagger that uses the Penn Treebank POS tag set.

Statistical POS taggers use machine learning algorithms, such as Hidden Markov Models (HMM) or Conditional Random Fields (CRF), to predict POS tags based on the context of the words in a sentence. They are more accurate but require much training data and computational resources. The SpaCy library’s POS tagger is an example of a statistical POS tagger that uses a neural network-based model trained on the OntoNotes 5 corpus.

Both rule-based and statistical POS tagging have their advantages and disadvantages. Rule-based taggers are simpler to implement and understand but less accurate than statistical taggers. Statistical taggers, however, are more accurate but require a large amount of training data and computational resources. In general, for most of the real-world use cases, it’s recommended to use statistical POS taggers, which are more accurate and robust.

About the Author

Neri Van Otten

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.

Recent Articles

fact checking with large language models LLMs

Fact-Checking With Large Language Models (LLMs): Is It A Powerful NLP Verification Tool?

Can a Machine Tell a Lie? Picture this: you're scrolling through social media, bombarded by claims about the latest scientific breakthrough, political scandal, or...

key elements of cognitive computing

Cognitive Computing Made Simple: Powerful Artificial Intelligence (AI) Capabilities & Examples

What is Cognitive Computing? The term "cognitive computing" has become increasingly prominent in today's rapidly evolving technological landscape. As our society...

Multilayer Perceptron Architecture

Multilayer Perceptron Explained And How To Train & Optimise MLPs

What is a Multilayer perceptron (MLP)? In artificial intelligence and machine learning, the Multilayer Perceptron (MLP) stands as one of the foundational architectures,...

Left: Illustration of SGD optimization with a typical learning rate schedule. The model converges to a minimum at the end of training. Right: Illustration of Snapshot Ensembling. The model undergoes several learning rate annealing cycles, converging to and escaping from multiple local minima. We take a snapshot at each minimum for test-time ensembling

Learning Rate In Machine Learning And Deep Learning Made Simple

Machine learning algorithms are at the core of many modern technological advancements, powering everything from recommendation systems to autonomous vehicles....

What causes the cold-start problem?

The Cold-Start Problem In Machine Learning Explained & 6 Mitigating Strategies

What is the Cold-Start Problem in Machine Learning? The cold-start problem refers to a common challenge encountered in machine learning systems, particularly in...

Nodes and edges in a bayesian network

Bayesian Network Made Simple [How It Is Used In Artificial Intelligence & Machine Learning]

What is a Bayesian Network? Bayesian network, also known as belief networks or Bayes nets, are probabilistic graphical models representing random variables and their...

Query2vec is an example of knowledge graph reasoning. Conjunctive queries: Where did Canadian citizens with Turing Award Graduate?

Knowledge Graph Reasoning Made Simple [3 Technical Methods & How To Handle Uncertanty]

What is Knowledge Graph Reasoning? Knowledge Graph Reasoning refers to drawing logical inferences, making deductions, and uncovering implicit information within a...

the process of speech recognition

How To Implement Speech Recognition [3 Ways & 7 Machine Learning Models]

What is Speech Recognition? Speech recognition, also known as automatic speech recognition (ASR) or voice recognition, is a technology that converts spoken language...

Key components of conversational AI

Conversational AI Explained: Top 9 Tools & How To Guide [Including GPT]

What is Conversational AI? Conversational AI, short for Conversational Artificial Intelligence, refers to using artificial intelligence and natural language processing...

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

nlp trends

2024 NLP Expert Trend Predictions

Get a FREE PDF with expert predictions for 2024. How will natural language processing (NLP) impact businesses? What can we expect from the state-of-the-art models?

Find out this and more by subscribing* to our NLP newsletter.

You have Successfully Subscribed!