What is Part-of-speech (POS) tagging?
Part-of-speech (POS) tagging is fundamental in natural language processing (NLP) and can be done in Python. It involves labelling words in a sentence with their corresponding POS tags. POS tags indicate the grammatical category of a word, such as noun, verb, adjective, adverb, etc. The goal of POS tagging is to determine a sentence’s syntactic structure and identify each word’s role in the sentence.
Table of Contents
There are two main types of POS tagging in NLP, and several Python libraries can be used for POS tagging, including NLTK, spaCy, and TextBlob. This article discusses the different types of POS taggers, the advantages and disadvantages of each, and provides code examples for the three most commonly used libraries in Python.
Several libraries do POS tagging in Python.
Types of Part-of-speech (POS) tagging in NLP
There are two main types of part-of-speech (POS) tagging in natural language processing (NLP):
- Rule-based POS tagging uses a set of linguistic rules and patterns to assign POS tags to words in a sentence. This method relies on a predefined set of grammatical rules, a dictionary of words, and their POS tags. The NLTK library’s
pos_tag()
function is an example of a rule-based POS tagger that uses the Penn Treebank POS tag set. - Statistical POS tagging uses machine learning algorithms, such as Hidden Markov Models (HMM) or Conditional Random Fields (CRF), to predict POS tags based on the context of the words in a sentence. This method requires a large amount of training data to create models. The SpaCy library’s POS tagger is an example of a statistical POS tagger that uses a neural network-based model trained on the OntoNotes 5 corpus.
Both rule-based and statistical POS tagging have their advantages and disadvantages. Rule-based taggers are simpler to implement and understand but less accurate than statistical taggers. Statistical taggers, however, are more accurate but require a large amount of training data and computational resources.
Advantages and disadvantages of the different types of Part-of-speech (POS) tagging for NLP in Python
Rule-based part-of-speech (POS) taggers and statistical POS taggers are two different approaches to POS tagging in natural language processing (NLP). Each method has its advantages and disadvantages.
The benefits of rule-based Part-of-speech (POS) tagging:
- Simple to implement and understand
- It doesn’t require a lot of computational resources or training data
- It can be easily customized to specific domains or languages
Disadvantages of rule-based Part-of-speech (POS) tagging:
- Less accurate than statistical taggers
- Limited by the quality and coverage of the rules
- It can be difficult to maintain and update
The Benefits of Statistical Part-of-speech (POS) Tagging:
- More accurate than rule-based taggers
- Don’t require a lot of human-written rules
- Can learn from large amounts of training data
Disadvantages of statistical Part-of-speech (POS) Tagging:
- Requires more computational resources and training data
- It can be difficult to interpret and debug
- Can be sensitive to the quality and diversity of the training data
In general, for most of the real-world use cases, it’s recommended to use statistical POS taggers, which are more accurate and robust. However, in some cases, the rule-based POS tagger is still useful, for example, for small or specific domains where the training data is unavailable or for specific languages that are not well-supported by existing statistical models.
Rule-based Part-of-speech (POS) tagging for NLP in Python code
1. NLTK Part-of-speech (POS) tagging
One common way to perform POS tagging in Python using the NLTK library is to use the
pos_tag()
function, which uses the Penn Treebank POS tag set. For example:
import nltk
nltk.download('averaged_perceptron_tagger')
sentence = "I am learning NLP in Python"
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
print(pos_tags)
This will make a list of tuples, each with a word and the POS tag that goes with it.
[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('NLP', 'NNP'), ('in', 'IN'), ('Python', 'NNP')]
It’s also possible to use other POS taggers, like Stanford POS Tagger, or others with better performance, like SpaCy POS Tagger, but they require additional setup and processing.
NLTK POS tagger abbreviations
Here is a list of the available abbreviations and their meaning.
Abbreviation | Meaning |
---|---|
CC | coordinating conjunction |
CD | cardinal digit |
DT | determiner |
EX | existential there |
FW | foreign word |
IN | preposition/subordinating conjunction |
JJ | This NLTK POS Tag is an adjective (large) |
JJR | adjective, comparative (larger) |
JJS | adjective, superlative (largest) |
LS | list market |
MD | modal (could, will) |
NN | noun, singular (cat, tree) |
NNS | noun plural (desks) |
NNP | proper noun, singular (sarah) |
NNPS | proper noun, plural (indians or americans) |
PDT | predeterminer (all, both, half) |
POS | possessive ending (parent\ ‘s) |
PRP | personal pronoun (hers, herself, him, himself) |
PRP$ | possessive pronoun (her, his, mine, my, our ) |
RB | adverb (occasionally, swiftly) |
RBR | adverb, comparative (greater) |
RBS | adverb, superlative (biggest) |
RP | particle (about) |
TO | infinite marker (to) |
UH | interjection (goodbye) |
VB | verb (ask) |
VBG | verb gerund (judging) |
VBD | verb past tense (pleaded) |
VBN | verb past participle (reunified) |
VBP | verb, present tense not 3rd person singular(wrap) |
VBZ | verb, present tense with 3rd person singular (bases) |
WDT | wh-determiner (that, what) |
WP | wh- pronoun (who) |
WRB | wh- adverb (how) |
2. TextBlob Part-of-speech (POS) tagging
Here is an example of how to use the part-of-speech (POS) tagging functionality in the TextBlob library in Python:
from textblob import TextBlob
# Define a sentence
sentence = "I am learning NLP in Python"
# Create a TextBlob object
text_blob = TextBlob(sentence)
# Use the 'tags' property to get the POS tags
pos_tags = text_blob.tags
print(pos_tags)
This will output a list of tuples, where each tuple contains a word and its corresponding POS tag, using the pattern-based POS tagger.
[('I', 'PRP'), ('am', 'VBP'), ('learning', 'VBG'), ('NLP', 'NNP'), ('in', 'IN'), ('Python', 'NNP')]
TextBlob also can tag using a statistical POS tagger. To use the NLTK POS Tagger, you can pass
pos_tagger
attribute to TextBlob, like this:
from textblob import TextBlob
# Define a sentence
sentence = "I am learning NLP in Python"
# Create a TextBlob object
text_blob = TextBlob(sentence, pos_tagger=nltk.pos_tag)
# Use the 'tags' property to get the POS tags
pos_tags = text_blob.tags
print(pos_tags)
Keep in mind that when using the NLTK POS Tagger, the NLTK library needs to be installed and the pos tagger downloaded.
TextBlob is a useful library for conveniently performing everyday NLP tasks, such as POS tagging, noun phrase extraction, sentiment analysis, etc. It is built on top of NLTK and provides a simple and easy-to-use API.
Statistical Part-of-speech (POS) tagging for NLP in Python code
3. Spacy Part-of-speech (POS) tagging
Here is an example of how to use the part-of-speech (POS) tagging functionality in the spaCy library in Python:
import spacy
# Load the 'en_core_web_sm' model
nlp = spacy.load('en_core_web_sm')
# Define a sentence
sentence = "I am learning NLP in Python"
# Process the sentence using spaCy's NLP pipeline
doc = nlp(sentence)
# Iterate through the token and print the token text and POS tag
for token in doc:
print(token.text, token.pos_)
This will output the token text and the POS tag for each token in the sentence:
I PRON
am AUX
learning VERB
NLP PROPN
in ADP
Python PROPN
The spaCy library’s POS tagger is based on a statistical model trained on the OntoNotes 5 corpus, and it can tag the text with high accuracy. It also can tag other features, like lemma, dependency, ner, etc.
Note that before running the code, you need to download the model you want to use, in this case,
en_core_web_sm
. You can do this by running
!python -m spacy download en_core_web_sm
on your command line.
4. NLTK Part-of-speech (POS) tagging
The Averaged Perceptron Tagger in NLTK is a statistical part-of-speech (POS) tagger that uses a machine learning algorithm called Averaged Perceptron. Here is an example of how to use it in Python:
import nltk
nltk.download('averaged_perceptron_tagger')
sentence = "I am learning NLP in Python"
# Tokenize the sentence
tokens = nltk.word_tokenize(sentence)
# POS tagging using the Averaged Perceptron Tagger
pos_tags = nltk.pos_tag(tokens, tagset='universal', tagger='averaged_perceptron')
print(pos_tags)
This will output a list of tuples, where each tuple contains a word and its corresponding POS tag, using the Averaged Perceptron Tagger
[('I', 'PRON'), ('am', 'VERB'), ('learning', 'VERB'), ('NLP', 'NOUN'), ('in', 'ADP'), ('Python', 'NOUN')]
You can see that the output tags are different from the previous example because the Averaged Perceptron Tagger uses the universal POS tagset, which is different from the Penn Treebank POS tagset.
The averaged perceptron tagger is trained on a large corpus of text, which makes it more robust and accurate than the default rule-based tagger provided by NLTK. It also allows you to specify the tagset, which is the set of POS tags that can be used for tagging; in this case, it’s using the ‘universal’ tagset, which is a cross-lingual tagset, useful for many NLP tasks in Python.
It’s important to note that the Averaged Perceptron Tagger requires loading the model before using it, which is why it’s necessary to download it using the
nltk.download()
function.
Conclusion
In conclusion, part-of-speech (POS) tagging is essential in natural language processing (NLP) and can be easily implemented using Python. The process involves labelling words in a sentence with their corresponding POS tags. There are two main types of POS tagging: rule-based and statistical.
Rule-based POS taggers use a set of linguistic rules and patterns to assign POS tags to words in a sentence. They are simple to implement and understand but less accurate than statistical taggers. The NLTK library’s
pos_tag()
function is an example of a rule-based POS tagger that uses the Penn Treebank POS tag set.
Statistical POS taggers use machine learning algorithms, such as Hidden Markov Models (HMM) or Conditional Random Fields (CRF), to predict POS tags based on the context of the words in a sentence. They are more accurate but require much training data and computational resources. The SpaCy library’s POS tagger is an example of a statistical POS tagger that uses a neural network-based model trained on the OntoNotes 5 corpus.
Both rule-based and statistical POS tagging have their advantages and disadvantages. Rule-based taggers are simpler to implement and understand but less accurate than statistical taggers. Statistical taggers, however, are more accurate but require a large amount of training data and computational resources. In general, for most of the real-world use cases, it’s recommended to use statistical POS taggers, which are more accurate and robust.
0 Comments