Skip to main navigation Skip to search Skip to main content

Detecting abusive comments at a fine-grained level in a low-resource language

Research output: Contribution to a Journal (Peer & Non Peer)Articlepeer-review

Abstract

YouTube is a video-sharing and social media platform where users create profiles and share videos for their followers to view, like, and comment on. Abusive comments on videos or replies to other comments may be offensive and detrimental for the mental health of users on the platform. It is observed that often the language used in these comments is informal and does not necessarily adhere to the formal syntactic and lexical structure of the language. Therefore, creating a rule-based system for filtering out abusive comments is challenging. This article introduces four datasets of abusive comments in Tamil and code-mixed Tamil-English extracted from YouTube. Comment-level annotation has been carried out for each dataset by assigning polarities to the comments. We hope these datasets can be used to train effective machine learning-based comment filters for these languages by mitigating the challenges associated with rule-based systems. In order to establish baselines on these datasets, we have carried out experiments with various machine learning classifiers and reported the results using F1-score, precision, and recall. Furthermore, we have employed a t-test to analyze the statistical significance of the results generated by the machine learning classifiers. Furthermore, we have employed SHAP values to analyze and explain the results generated by the machine learning classifiers. The primary contribution of this paper is the construction of a publicly accessible dataset of social media messages annotated with a fine-grained abusive speech in the low-resource Tamil language. Overall, we discovered that MURIL performed well on the binary abusive comment detection task, showing the applicability of multilingual transformers for this work. Nonetheless, a fine-grained annotation for Fine-grained abusive comment detection resulted in a significantly lower number of samples per class, and classical machine learning models outperformed deep learning models, which require extensive training datasets, on this challenge. According to our knowledge, this was the first Tamil-language study on FGACD focused on diverse ethnicities. The methodology for detecting abusive messages described in this work may aid in the creation of comment filters for other under-resourced languages on social media.
Original languageEnglish (Ireland)
Number of pages100006
JournalNatural Language Processing Journal
Volume3
Publication statusPublished - 1 Jan 2023

Authors (Note for portal: view the doc link for the full list of authors)

  • Authors
  • Chakravarthi, Bharathi Raja and Priyadharshini, Ruba and Banerjee, Shubanker and Jagadeeshan, Manoj Balaji and Kumaresan, Prasanna Kumar and Ponnusamy, Rahul and Benhur, Sean and McCrae, John Philip
  • Bharathi Raja Chakravarthi and Ruba Priyadharshini and Shubanker Banerjee and Manoj Balaji Jagadeeshan and Prasanna Kumar Kumaresan and Rahul Ponnusamy and Sean Benhur and John Philip McCrae
  • Bharathi Raja Chakravarthi and Ruba Priyadharshini and Shubanker Banerjee and Manoj Balaji Jagadeeshan and Prasanna Kumar Kumaresan and Rahul Ponnusamy and Sean Benhur and John Philip McCrae

Fingerprint

Dive into the research topics of 'Detecting abusive comments at a fine-grained level in a low-resource language'. Together they form a unique fingerprint.

Cite this