TY - GEN
T1 - Detecting Cybercrimes in Accordance with Pakistani Law
T2 - Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024
AU - Ullah, Faizad
AU - Faheem, Ali
AU - Azam, Ubaid
AU - Ayub, Muhammad Sohaib
AU - Kamiran, Faisal
AU - Karim, Asim
N1 - Publisher Copyright:
© 2024 ELRA Language Resource Association: CC BY-NC 4.0.
PY - 2024
Y1 - 2024
N2 - Cybercrime is a serious and growing threat affecting millions of people worldwide. Detecting cybercrimes from text messages is challenging, as it requires understanding the linguistic and cultural nuances of different languages and regions. Roman Urdu is a widely used language in Pakistan and other South Asian countries, however, it lacks sufficient resources and tools for natural language processing and cybercrime detection. To address this problem, we make three main contributions in this paper. (1) We create and release CRU, a benchmark dataset for text-based cybercrime detection in Roman Urdu, which covers a number of cybercrimes as defined by the Prevention of Electronic Crimes Act (PECA) of Pakistan. This dataset is annotated by experts following a standardized procedure based on Pakistan's legal framework. (2) We perform experiments on four pre-trained language models (PLMs) for cybercrime text classification in Roman Urdu. Our results show that xlm-roberta-base is the best model for this task, achieving the highest performance on all metrics. (3) We explore the utility of prompt engineering techniques, namely prefix and cloze prompts, for enhancing the performance of PLMs for low-resource languages such as Roman Urdu. We analyze the impact of different prompt shapes and k-shot settings on the performance of xlm-roberta-base and bert-base-multilingual-cased. We find that prefix prompts are more effective than cloze prompts for Roman Urdu classification tasks, as they provide more contextually relevant completions for the models. Our work provides useful insights and resources for future research on cybercrime detection and text classification in low-resource languages.
AB - Cybercrime is a serious and growing threat affecting millions of people worldwide. Detecting cybercrimes from text messages is challenging, as it requires understanding the linguistic and cultural nuances of different languages and regions. Roman Urdu is a widely used language in Pakistan and other South Asian countries, however, it lacks sufficient resources and tools for natural language processing and cybercrime detection. To address this problem, we make three main contributions in this paper. (1) We create and release CRU, a benchmark dataset for text-based cybercrime detection in Roman Urdu, which covers a number of cybercrimes as defined by the Prevention of Electronic Crimes Act (PECA) of Pakistan. This dataset is annotated by experts following a standardized procedure based on Pakistan's legal framework. (2) We perform experiments on four pre-trained language models (PLMs) for cybercrime text classification in Roman Urdu. Our results show that xlm-roberta-base is the best model for this task, achieving the highest performance on all metrics. (3) We explore the utility of prompt engineering techniques, namely prefix and cloze prompts, for enhancing the performance of PLMs for low-resource languages such as Roman Urdu. We analyze the impact of different prompt shapes and k-shot settings on the performance of xlm-roberta-base and bert-base-multilingual-cased. We find that prefix prompts are more effective than cloze prompts for Roman Urdu classification tasks, as they provide more contextually relevant completions for the models. Our work provides useful insights and resources for future research on cybercrime detection and text classification in low-resource languages.
KW - Cybercrime Detection
KW - Pakistani Cybercrimes Law
KW - Prompt-Based Classification
KW - Roman Urdu
UR - https://www.scopus.com/pages/publications/85195914770
M3 - Conference Publication
AN - SCOPUS:85195914770
T3 - 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings
SP - 4717
EP - 4728
BT - 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings
A2 - Calzolari, Nicoletta
A2 - Kan, Min-Yen
A2 - Hoste, Veronique
A2 - Lenci, Alessandro
A2 - Sakti, Sakriani
A2 - Xue, Nianwen
PB - European Language Resources Association (ELRA)
Y2 - 20 May 2024 through 25 May 2024
ER -