The rector of the Universidad Internacional Iberoamericana de México (International Iberoamerican University of Mexico, UNINI Mexico), Dr. Luis Dzul López, collaborates with the Universidad Internacional Iberoamericana (International Iberoamerican University, UNIB) on a study that presents a lemmatization algorithm for the Urdu language.
In the field of natural language processing (NLP), machine translation (MT) optimizes communication between people by bridging the language gap. In machine translation, normalization and morphological analysis are important modules for information retrieval (IR).
Derivation and lemmatization are often used as techniques for finding the correct root of words in a language. However, there are studies on IR systems for the Urdu language that show that lemmatization is more efficient than derivation, given the infixes that are present in Urdu words. In semantics, the goal of lemmatization is to group the inflected forms of a word to decompose them into a common form and analyze them as a basic term. In other words, it consists of eliminating the inflectional endings of words to return them to their base form.
There are few studies on the lemmatization of Urdu and such studies tend to focus on rules, leaving aside elementary aspects such as noun identification, the handling of empty words, borrowings, among others. Therefore, the aim of this research is to present an improved lemmatization algorithm based on ordinary neural network models for the Urdu language. Focusing mainly on the detection of proper names, lemmatization of Urdu morphological, inflectional, and derivational words, among others.
Research results
The results showed that this proposed model has the ability to address missing areas of Urdu lemmatization, such as the handling of loanwords, empty words, noun identification, and Urdu words with diacritical marks. Likewise, this model efficiently handles the lemmatization of Urdu morphological, inflectional, and derivational words.
The integration of the AFED model greatly improved the performance of the system achieving accuracy, precision, recall, and F-score of 0.96, 0.95, 0.95, and 0.95, respectively.
If you want to know more about this fascinating study, click here.
For further research, check the UNIB repository.