Abstract
YouTube is a video-sharing and social media platform where users create profiles and share videos for their followers to view, like, and comment on. Abusive comments on videos or replies to other comments may be offensive and detrimental for the mental health of users on the platform. It is observed that often the language used in these comments is informal and does not necessarily adhere to the formal syntactic and lexical structure of the language. Therefore, creating a rule-based system for filtering out abusive comments is challenging. This article introduces four datasets of abusive comments in Tamil and code-mixed Tamil-English extracted from YouTube. Comment-level annotation has been carried out for each dataset by assigning polarities to the comments. We hope these datasets can be used to train effective machine learning-based comment filters for these languages by mitigating the challenges associated with rule-based systems. In order to establish baselines on these datasets, we have carried out experiments with various machine learning classifiers and reported the results using F1-score, precision, and recall. Furthermore, we have employed a t-test to analyze the statistical significance of the results generated by the machine learning classifiers. Furthermore, we have employed SHAP values to analyze and explain the results generated by the machine learning classifiers. The primary contribution of this paper is the construction of a publicly accessible dataset of social media messages annotated with a fine-grained abusive speech in the low-resource Tamil language. Overall, we discovered that MURIL performed well on the binary abusive comment detection task, showing the applicability of multilingual transformers for this work. Nonetheless, a fine-grained annotation for Fine-grained abusive comment detection resulted in a significantly lower number of samples per class, and classical machine learning models outperformed deep learning models, which require extensive training datasets, on this challenge. According to our knowledge, this was the first Tamil-language study on FGACD focused on diverse ethnicities. The methodology for detecting abusive messages described in this work may aid in the creation of comment filters for other under-resourced languages on social media.
| Original language | English (Ireland) |
|---|---|
| Number of pages | 100006 |
| Journal | Natural Language Processing Journal |
| Volume | 3 |
| Publication status | Published - 1 Jan 2023 |
Authors (Note for portal: view the doc link for the full list of authors)
- Authors
- Chakravarthi, Bharathi Raja and Priyadharshini, Ruba and Banerjee, Shubanker and Jagadeeshan, Manoj Balaji and Kumaresan, Prasanna Kumar and Ponnusamy, Rahul and Benhur, Sean and McCrae, John Philip
- Bharathi Raja Chakravarthi and Ruba Priyadharshini and Shubanker Banerjee and Manoj Balaji Jagadeeshan and Prasanna Kumar Kumaresan and Rahul Ponnusamy and Sean Benhur and John Philip McCrae
- Bharathi Raja Chakravarthi and Ruba Priyadharshini and Shubanker Banerjee and Manoj Balaji Jagadeeshan and Prasanna Kumar Kumaresan and Rahul Ponnusamy and Sean Benhur and John Philip McCrae
Fingerprint
Dive into the research topics of 'Detecting abusive comments at a fine-grained level in a low-resource language'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver