Top 4 Easy Ways To Implement POS Tagging In NLP Using Python

by | Jan 24, 2023 | Data Science, Natural Language Processing

What is POS tagging?

Part-of-speech (POS) tagging is fundamental in natural language processing (NLP) and can be carried out in Python. It involves labelling words in a sentence with their corresponding POS tags. POS tags indicate the grammatical category of a word, such as noun, verb, adjective, adverb, etc. The goal of POS tagging is to determine a sentence’s syntactic structure and identify each word’s role in the sentence.

There are two main types of POS tagging in NLP, and several Python libraries can be used for POS tagging, including NLTK, spaCy, and TextBlob. This article discusses the different types of POS taggers, the advantages and disadvantages of each, and provides code examples for the three most commonly used libraries in Python.

Several libraries do POS tagging in Python in NLP

Several libraries do POS tagging in Python.

Types of POS tagging in NLP

There are two main types of part-of-speech (POS) tagging in natural language processing (NLP):

  1. Rule-based POS tagging uses a set of linguistic rules and patterns to assign POS tags to words in a sentence. This method relies on a predefined set of grammatical rules, a dictionary of words, and their POS tags. The NLTK library’s pos_tag() function is an example of a rule-based POS tagger that uses the Penn Treebank POS tag set.
  2. Statistical POS tagging uses machine learning algorithms, such as Hidden Markov Models (HMM) or Conditional Random Fields (CRF), to predict POS tags based on the context of the words in a sentence. This method requires a large amount of training data to create models. The SpaCy library’s POS tagger is an example of a statistical POS tagger that uses a neural network-based model trained on the OntoNotes 5 corpus.

Both rule-based and statistical POS tagging have their advantages and disadvantages. Rule-based taggers are simpler to implement and understand but less accurate than statistical taggers. Statistical taggers, however, are more accurate but require a large amount of training data and computational resources.

Advantages and disadvantages of the different types of POS taggers for NLP in Python

Rule-based part-of-speech (POS) taggers and statistical POS taggers are two different approaches to POS tagging in natural language processing (NLP). Each method has its advantages and disadvantages.

The benefits of rule-based POS taggers:

  • Simple to implement and understand
  • It doesn’t require a lot of computational resources or training data
  • It can be easily customized to specific domains or languages

Disadvantages of rule-based POS taggers:

  • Less accurate than statistical taggers
  • Limited by the quality and coverage of the rules
  • It can be difficult to maintain and update

The Benefits of statistical POS Tagger:

  • More accurate than rule-based taggers
  • Don’t require a lot of human-written rules
  • Can learn from large amounts of training data

Disadvantages of statistical POS Tagger:

  • Requires more computational resources and training data
  • It can be difficult to interpret and debug
  • Can be sensitive to the quality and diversity of the training data

In general, for most of the real-world use cases, it’s recommended to use statistical POS taggers, which are more accurate and robust. However, in some cases, the rule-based POS tagger is still useful, for example, for small or specific domains where the training data is unavailable or for specific languages that are not well-supported by existing statistical models.

Rule-based POS tagging for NLP in Python code

1. NLTK

One common way to perform POS tagging in Python using the NLTK library is to use the pos_tag() function, which uses the Penn Treebank POS tag set. For example:

import nltk
nltk.download('averaged_perceptron_tagger')

sentence = "I am learning NLP in Python"
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
print(pos_tags)

This will make a list of tuples, each with a word and the POS tag that goes with it.

[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('NLP', 'NNP'), ('in', 'IN'), ('Python', 'NNP')]

It’s also possible to use other POS taggers, like Stanford POS Tagger, or others with better performance, like SpaCy POS Tagger, but they require additional setup and processing.

NLTK POS tagger abbreviations

Here is a list of the available abbreviations and their meaning.

AbbreviationMeaning
CCcoordinating conjunction
CDcardinal digit
DTdeterminer
EXexistential there
FWforeign word
INpreposition/subordinating conjunction
JJThis NLTK POS Tag is an adjective (large)
JJRadjective, comparative (larger)
JJSadjective, superlative (largest)
LSlist market
MDmodal (could, will)
NNnoun, singular (cat, tree)
NNSnoun plural (desks)
NNPproper noun, singular (sarah)
NNPSproper noun, plural (indians or americans)
PDTpredeterminer (all, both, half)
POSpossessive ending (parent\ ‘s)
PRPpersonal pronoun (hers, herself, him, himself)
PRP$possessive pronoun (her, his, mine, my, our )
RBadverb (occasionally, swiftly)
RBRadverb, comparative (greater)
RBSadverb, superlative (biggest)
RPparticle (about)
TOinfinite marker (to)
UHinterjection (goodbye)
VBverb (ask)
VBGverb gerund (judging)
VBDverb past tense (pleaded)
VBNverb past participle (reunified)
VBPverb, present tense not 3rd person singular(wrap)
VBZverb, present tense with 3rd person singular (bases)
WDTwh-determiner (that, what)
WPwh- pronoun (who)
WRBwh- adverb (how)

2. TextBlob

Here is an example of how to use the part-of-speech (POS) tagging functionality in the TextBlob library in Python:

from textblob import TextBlob

# Define a sentence
sentence = "I am learning NLP in Python"

# Create a TextBlob object
text_blob = TextBlob(sentence)

# Use the 'tags' property to get the POS tags
pos_tags = text_blob.tags
print(pos_tags)

This will output a list of tuples, where each tuple contains a word and its corresponding POS tag, using the pattern-based POS tagger.

[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('NLP', 'NNP'), ('in', 'IN'), ('Python', 'NNP')]

TextBlob also can tag using a statistical POS tagger. To use the NLTK POS Tagger, you can pass pos_tagger attribute to TextBlob, like this:

from textblob import TextBlob

# Define a sentence
sentence = "I am learning NLP in Python"

# Create a TextBlob object
text_blob = TextBlob(sentence, pos_tagger=nltk.pos_tag)

# Use the 'tags' property to get the POS tags
pos_tags = text_blob.tags
print(pos_tags)

Keep in mind that when using the NLTK POS Tagger, the NLTK library needs to be installed and the pos tagger downloaded.

TextBlob is a useful library for conveniently performing everyday NLP tasks, such as POS tagging, noun phrase extraction, sentiment analysis, etc. It is built on top of NLTK and provides a simple and easy-to-use API.

Statistical POS tagging for NLP in Python code

3. Spacy

Here is an example of how to use the part-of-speech (POS) tagging functionality in the spaCy library in Python:

import spacy

# Load the 'en_core_web_sm' model
nlp = spacy.load('en_core_web_sm')

# Define a sentence
sentence = "I am learning NLP in Python"

# Process the sentence using spaCy's NLP pipeline
doc = nlp(sentence)

# Iterate through the token and print the token text and POS tag
for token in doc:
    print(token.text, token.pos_)

This will output the token text and the POS tag for each token in the sentence:

I PRON
am AUX
learning VERB
NLP PROPN
in ADP
Python PROPN

The spaCy library’s POS tagger is based on a statistical model trained on the OntoNotes 5 corpus, and it can tag the text with high accuracy. It also can tag other features, like lemma, dependency, ner, etc.

Note that before running the code, you need to download the model you want to use, in this case, en_core_web_sm. You can do this by running !python -m spacy download en_core_web_sm on your command line.

4. NLTK

The Averaged Perceptron Tagger in NLTK is a statistical part-of-speech (POS) tagger that uses a machine learning algorithm called Averaged Perceptron. Here is an example of how to use it in Python:

import nltk
nltk.download('averaged_perceptron_tagger')

sentence = "I am learning NLP in Python"

# Tokenize the sentence
tokens = nltk.word_tokenize(sentence)

# POS tagging using the Averaged Perceptron Tagger
pos_tags = nltk.pos_tag(tokens, tagset='universal', tagger='averaged_perceptron')
print(pos_tags)

This will output a list of tuples, where each tuple contains a word and its corresponding POS tag, using the Averaged Perceptron Tagger

[('I', 'PRON'), ('am', 'VERB'), ('learning', 'VERB'), ('NLP', 'NOUN'), ('in', 'ADP'), ('Python', 'NOUN')]

You can see that the output tags are different from the previous example because the Averaged Perceptron Tagger uses the universal POS tagset, which is different from the Penn Treebank POS tagset.

The averaged perceptron tagger is trained on a large corpus of text, which makes it more robust and accurate than the default rule-based tagger provided by NLTK. It also allows you to specify the tagset, which is the set of POS tags that can be used for tagging; in this case, it’s using the ‘universal’ tagset, which is a cross-lingual tagset, useful for many NLP tasks in Python.

It’s important to note that the Averaged Perceptron Tagger requires loading the model before using it, which is why it’s necessary to download it using the nltk.download() function.

Conclusion

In conclusion, part-of-speech (POS) tagging is essential in natural language processing (NLP) and can be easily implemented using Python. The process involves labelling words in a sentence with their corresponding POS tags. There are two main types of POS tagging: rule-based and statistical.

Rule-based POS taggers use a set of linguistic rules and patterns to assign POS tags to words in a sentence. They are simple to implement and understand but less accurate than statistical taggers. The NLTK library’s pos_tag() function is an example of a rule-based POS tagger that uses the Penn Treebank POS tag set.

Statistical POS taggers use machine learning algorithms, such as Hidden Markov Models (HMM) or Conditional Random Fields (CRF), to predict POS tags based on the context of the words in a sentence. They are more accurate but require much training data and computational resources. The SpaCy library’s POS tagger is an example of a statistical POS tagger that uses a neural network-based model trained on the OntoNotes 5 corpus.

Both rule-based and statistical POS tagging have their advantages and disadvantages. Rule-based taggers are simpler to implement and understand but less accurate than statistical taggers. Statistical taggers, however, are more accurate but require a large amount of training data and computational resources. In general, for most of the real-world use cases, it’s recommended to use statistical POS taggers, which are more accurate and robust.

About the Author

Neri Van Otten

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.

Related Articles

Most Powerful Open Source Large Language Models (LLM) 2023

Open Source Large Language Models (LLM) – Top 10 Most Powerful To Consider In 2023

What are open-source large language models? Open-source large language models, such as GPT-3.5, are advanced AI systems designed to understand and generate human-like...

l1 and l2 regularization promotes simpler models that capture the underlying patterns and generalize well to new data

L1 And L2 Regularization Explained, When To Use Them & Practical Examples

L1 and L2 regularization are techniques commonly used in machine learning and statistical modelling to prevent overfitting and improve the generalization ability of a...

Hyperparameter tuning often involves a combination of manual exploration, intuition, and systematic search methods

Hyperparameter Tuning In Machine Learning & Deep Learning [The Ultimate Guide With How To Examples In Python]

What is hyperparameter tuning in machine learning? Hyperparameter tuning is critical to machine learning and deep learning model development. Machine learning...

Countvectorizer is a simple techniques that counts the amount of times a word occurs

CountVectorizer Tutorial In Scikit-Learn And Python (NLP) With Advantages, Disadvantages & Alternatives

What is CountVectorizer in NLP? CountVectorizer is a text preprocessing technique commonly used in natural language processing (NLP) tasks for converting a collection...

Social media messages is an example of unstructured data

Difference Between Structured And Unstructured Data & How To Turn Unstructured Data Into Structured Data

Unstructured data has become increasingly prevalent in today's digital age and differs from the more traditional structured data. With the exponential growth of...

sklearn confusion matrix

F1 Score The Ultimate Guide: Formulas, Explanations, Examples, Advantages, Disadvantages, Alternatives & Python Code

The F1 score formula The F1 score is a metric commonly used to evaluate the performance of binary classification models. It is a measure of a model's accuracy, and it...

regression vs classification, what is the difference

Regression Vs Classification — Understand How To Choose And Switch Between Them

Classification vs regression are two of the most common types of machine learning problems. Classification involves predicting a categorical outcome, such as whether an...

Several images of probability densities of the Dirichlet distribution as functions.

Latent Dirichlet Allocation (LDA) Made Easy And Top 3 Ways To Implement In Python

Latent Dirichlet Allocation explained Latent Dirichlet Allocation (LDA) is a statistical model used for topic modelling in natural language processing. It is a...

One of the critical features of GPT-3 is its ability to perform few-shot and zero-shot learning. Fine tuning can further improve GPT-3

How To Fine-tuning GPT-3 Tutorial In Python With Hugging Face

What is GPT-3? GPT-3 (Generative Pre-trained Transformer 3) is a state-of-the-art language model developed by OpenAI, a leading artificial intelligence research...

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

nlp trends

2023 NLP Expert Trend Predictions

Get a FREE PDF with expert predictions for 2023. How will natural language processing (NLP) impact businesses? What can we expect from the state-of-the-art models?

Find out this and more by subscribing* to our NLP newsletter.

You have Successfully Subscribed!