Natural language processing (NLP) for Arabic text involves tokenization, stemming, lemmatization, part-of-speech tagging, and named entity recognition, among others. These tasks can be challenging due to the complex morphological structure of Arabic, which includes a rich system of prefixes, suffixes, and infixes that can change the form and meaning of words. Moreover, Arabic is a heavily inflected language, meaning the same word can have different forms depending on its syntactic role in a sentence.
Table of Contents
To deal with these problems, researchers have developed tools and methods designed to work with Arabic text. In this guide, we will cover the most commonly used Python preprocessing methods in Arabic so that you can start building a machine learning model in Arabic.
Challenges of NLP tasks with Arabic text
Arabic text is in many ways different from English text.
There are a few problems to solve when working with natural language processing (NLP) tasks in Arabic:
- Orthographic variations: Arabic script is written without spaces between words, and there are several different orthographic conventions for representing short vowels, long vowels, and other vowel sounds. This can make it challenging to tokenize Arabic text accurately.
- Morphological complexity: Arabic has a highly inflected and agglutinative morphology, meaning that words can have many inflexions and affixes. This can make it difficult to accurately identify the base form of a word and its part-of-speech tag.
- Syntactic ambiguity: Arabic has a rich system of verbal and nominal suffixes, which can make it difficult to determine the syntactic structure of a sentence. For example, a single verb can show different moods, voices, and verb tenses.
- Dialectal variations: Arabic is spoken by more than 400 million people across a wide geographic area, and many different dialects of Arabic are spoken throughout the Middle East and North Africa. These dialects’ vocabulary, grammar, and pronunciation can be very different, making it hard to build NLP systems that work well with them.
- Limited resources: Few annotated Arabic language corpora and NLP tools are available compared to English or French. This can make it challenging to develop high-quality NLP systems for Arabic.
Should we translate Arabic to English for NLP tasks?
There are several advantages to working with Arabic text natively in natural language processing (NLP) tasks as opposed to translating it first:
- Improved accuracy: Working with text in its original language can often lead to more accurate results since nuances and cultural references may be lost in translation.
- Greater efficiency: Translating a text from one language to another can be time-consuming and resource-intensive. By working with the text natively, you can save time and resources.
- Better representation of context: When you work with the original text, you can better show what was going on when the text was written. This is important for some NLP tasks, like figuring out how someone feels or if someone is being sarcastic.
- Access to a larger dataset: If you work natively with Arabic text, you may be able to access a larger dataset of Arabic text, which can be helpful for training machine learning models.
It’s worth noting that working with Arabic text natively can also present some challenges, such as the need to deal with right-to-left text and the presence of diacritics and special characters. However, with proper tools and resources, these challenges can be overcome, and you can reap the benefits mentioned above.
NLP Preprocessing Arabic text in Python
Preprocessing is the process of transforming raw data into a more usable format. Read our article on the 14 most commonly used preprocessing techniques in English; we will only discuss a handful of these techniques here in Arabic.
Here is a list of some of the NLP tools and resources provided by these libraries:
- Tokenization: tools for splitting Arabic text into individual tokens or words.
- Stemming: stemmers designed explicitly for Arabic, which can handle the complex morphological structure of the language.
- Lemmatization: the process of reducing words to their base form, or lemma, while accounting for the part of speech and context in which the word is used.
- Part-of-speech tagging: tools for labelling words with their part of speech, such as nouns, verbs, adjectives, etc.
- Named entity recognition: uses tools to find and label named entities like proper nouns, organisations, places, etc.
- Corpus and lexical resources: annotated corpora and lexical databases that can be used for tasks like language modelling and information retrieval.
Tokenization in Arabic
Tokenization is the process of breaking a sequence of text into smaller units called tokens, such as words, phrases, symbols, and other elements. For the Arabic language, tokenization is a complex task due to the differences between the written and spoken forms of the language. As a result, Arabic tokenization involves both morphological and syntactic analysis of the text in order to identify and divide it into appropriate tokens
Here is an example of how to tokenize Arabic text using NLTK:
import nltk # input Arabic text text = "المعالجة الطبيعية لللغة هي مجال من الذكاء الاصطناعي الذي يتعامل مع التفاعل بين الحاسوبات واللغة الطبيعية." # tokenize the text tokens = nltk.word_tokenize(text) print("Tokens:", tokens)
This will output the following list of tokens:
Tokens: ['المعالجة', 'الطبيعية', 'لللغة', 'هي', 'مجال', 'من', 'الذكاء', 'الاصطناعي', 'الذي', 'يتعامل', 'مع', 'التفاعل', 'بين', 'الحاسوبات', 'واللغة', 'الطبيعية', '.']
Stemming in Arabic
Stemming reduces words to their base form or root by removing common morphological affixes such as suffixes and prefixes. It is often used in natural language processing (NLP) to prepare the text for processing and help with tasks like finding information and classifying text.
To perform stemming on Arabic text in Python, you can use NLTK or other specialised libraries such as PyArabic and arabic_nlp. These libraries provide stemmers designed explicitly for Arabic that can handle the complex morphological structure of the language.
Here is an example of how to stem Arabic text using NLTK:
import nltk nltk.download('isri') from nltk.stem.isri import ISRIStemmer stemmer = ISRIStemmer() # Stem the word "تعلمت" (meaning "I learned") stemmed_word = stemmer.stem("تعلمت") print(stemmed_word) # Output: علم # Stem the word "تعلموا" (meaning "you (plural) learned") stemmed_word = stemmer.stem("تعلموا") print(stemmed_word) # Output: علم
In this example, we first download the ISRI stemmer from NLTK. Then, we create an instance of the ISRIStemmer class and use the stem method to stem two different words in Arabic. The stemmer employs a set of rules to remove inflexions and affixes from words before returning the base form.
Note that this example uses the ISRI stemmer, a rule-based stemmer designed explicitly for Arabic. Other stemmers are available for Arabic, such as the Porter stemmer and the Snowball stemmer, which use different algorithms and may produce different results.
Part-of-speech tagging in Arabic
Part-of-speech tagging is used to identify the part of speech for each word in a sentence, such as nouns, verbs, adjectives, adverbs, and more. Different parts of speech have different endings in Arabic, so an accurate tagger requires knowledge of the language’s morphology.
Here is an example of how to perform part-of-speech tagging in Arabic using NLTK in Python:
import nltk nltk.download('averaged_perceptron_tagger') nltk.download('arabic') sentence = "وقد تعلمت اللغة العربية في المدرسة." # Tokenize the sentence tokens = nltk.word_tokenize(sentence, language='arabic') print(tokens) # Output: ['وقد', 'تعلمت', 'اللغة', 'العربية', 'في', 'المدرسة', '.'] # Perform part-of-speech tagging tagged_tokens = nltk.pos_tag(tokens, lang='arb') print(tagged_tokens) # Output: [('وقد', 'IN'), ('تعلمت', 'VBD'), ('اللغة', 'NN'), ('العربية', 'NN'), ('في', 'IN'), ('المدرسة', 'NN'), ('.', '.')]
In this example, we first download the averaged perceptron tagger and the Arabic part-of-speech tagger from NLTK. Then, we tokenize a sentence in Arabic using the
word_tokenize function and specify the language as ‘arabic’. Next, we use the
pos_tag function to perform part-of-speech tagging on the tokenized sentence and specify the language as ‘arb’. The function returns a list of tuples, each consisting of a token and its part-of-speech tag.
Note that this example uses the averaged perceptron tagger, which is a machine learning model that has been trained on a large dataset of Arabic text. So, the tags made by the tagger might not always be completely accurate, especially for words that aren’t used very often or have more than one meaning.
Named entity recognition (NER) in Arabic
Named Entity Recognition (NER) identifies and classifies named entities in text, such as people, organisations, locations, etc. Here is an example of how to perform NER in Arabic using NLTK in Python:
import nltk nltk.download('maxent_ne_chunker') nltk.download('words') sentence = "وقد تعلمت اللغة العربية في المدرسة." # Tokenize the sentence tokens = nltk.word_tokenize(sentence, language='arabic') print(tokens) # Output: ['وقد', 'تعلمت', 'اللغة', 'العربية', 'في', 'المدرسة', '.'] # Perform named entity recognition tagged_tokens = nltk.pos_tag(tokens, lang='arb') print(tagged_tokens) # Output: [('وقد', 'IN'), ('تعلمت', 'VBD'), ('اللغة', 'NN'), ('العربية', 'NN'), ('في', 'IN'), ('المدرسة', 'NN'), ('.', '.')] chunks = nltk.ne_chunk(tagged_tokens, binary=True) print(chunks) # Output: # (S # (NE وقد/IN) # (NE تعلمت/VBD) # (NE اللغة/NN) # (NE العربية/NN) # (NE في/IN) # (NE المدرسة/NN) # (NE ./.))
In this example, we first download the maximum entropy named entity chunker and the Arabic words dataset from NLTK. Then, we tokenize a sentence in Arabic using the word_tokenize function and specify the language as ‘arabic’. Next, we use the pos_tag function to perform part-of-speech tagging on the tokenized sentence and specify the language as ‘arb’. Finally, we use the ne_chunk function to perform named entity recognition on the tagged tokens. The function returns a tree structure, where each leaf node represents a token, and the non-leaf nodes represent named entities.
Note that this example uses the maximum entropy named entity chunker, a machine learning model trained on a large dataset of Arabic text. As a result, the named entities identified by the chunker may not always be completely accurate, especially for uncommon or ambiguous words. This highlights one of the most important things to keep in mind when working with natural language processing; the model used is only as good as the data it was trained on.
Deep learning for Arabic
Word embedding is a technique in natural language processing (NLP) that represents words as numerical vectors in a continuous, low-dimensional space. Word embeddings can record the semantic and syntactic relationships between words. This makes it possible to do things like model languages, find information, and translate them automatically.
To create word embeddings for Arabic text, you can use a pre-trained word embedding model or train your word embedding model using a large dataset of Arabic text. Many libraries and tools, like Gensim, FastText, and Flair, can be used with Python to make word embeddings.
Here is an example of how to use the fastText library to make word embeddings for Arabic text:
import fasttext # input Arabic text text = "المعالجة الطبيعية لللغة هي مجال من الذكاء الاصطناعي الذي يتعامل مع التفاعل بين الحاسوبات واللغة الطبيعية." # create word embedding model model = fasttext.train_unsupervised('arabic.txt', epoch=25) # get word embeddings for words in text word_embeddings = model.get_word_vector(text) print("Word embeddings:", word_embeddings)
This will output the word embeddings for the words in the input text as a numerical vector.
Closing thoughts on NLP in Arabic
Working with other languages can be difficult as fewer tools and data sets are available than in English. However, Arabic still has quite a few specific tools you can use to process your text. What tools did you end up using? Let us know in the comments.