Abstract
Roman Urdu is a prevalent medium of expression on social media, news websites, and text messages in the subcontinent, making it a valuable data source for social media and text analytics, particularly in the Indo-Pak perspective. However, despite the immense potential, limited efforts have been made in the area of Roman Urdu text analytics due to various complexities, such as a lack of a standard lexicon, the informal nature of the text, and the lack of text processing tools. The development of the Roman Urdu Part-of-Speech (POS) dataset and the implementation of a robust tagger hold immense importance for text analytics in Roman Urdu. In this work, we created a comprehensive, large-scale Roman Urdu POS dataset and developed a Roman Urdu POS tagger, laying the foundation for future advancements in advanced text analysis. Our approach involved the utilization of Hidden Markov Models, Neural Networks, state-of-the-art transformer models, and Large Language Models as baselines. In our work, we curated two distinct test datasets: one with lexical variation and the other without such variation. This approach allowed us to test the model’s robustness in handling different linguistic challenges posed by lexical variations. Our tagger yields high-quality output with an accuracy score of 96% without lexical variation and 86% on test data with lexical variations. We also evaluated state-of-the-art Large Language Models (GPT-4o and Llama-3-8B) in zero-shot and few-shot settings, with GPT-4o achieving up to 53.78% accuracy in the few-shot configuration, demonstrating a substantial performance gap compared to specialized models. This work establishes a comprehensive framework for Roman Urdu POS tagging that effectively addresses lexical variation challenges, providing essential resources and benchmarks for advancing Roman Urdu natural language processing research.
| Original language | English |
|---|---|
| Journal | Language Resources and Evaluation |
| DOIs | |
| Publication status | Accepted/In press - 2025 |
| Externally published | Yes |
Keywords
- Low resource
- Part of speech
- Roman Urdu
Fingerprint
Dive into the research topics of 'Part of speech (POS) tagging in Roman Urdu: datasets and models'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver