TY - GEN
T1 - Contrastive Learning-Enhanced BERT Models for Hate Speech Detection in Marathi and Telugu
AU - Kayande, Devendra
AU - Ponnusamy, Kishore Kumar
AU - Kumaresan, Prasanna Kumar
AU - Buitelaar, Paul
AU - Chakravarthi, Bharathi Raja
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s).
PY - 2025/7/24
Y1 - 2025/7/24
N2 - Homophobia and transphobia are pervasive issues in online platforms, manifesting as hate speech directed towards LGBTQ+ individuals. Identifying and mitigating such toxic language is crucial for creating safer online spaces especially in low resource Indic languages. This work focuses on the task of detecting homophobia, transphobia, and non-anti-LGBT+ content in YouTube comments, which are annotated at the comment/post level in Marathi and Telugu languages. We employed pre-trained and fine-tuned versions of BERT models on Marathi and Telugu data. These models were further fine-tuned with contrastive learning objectives to enhance their discriminatory power. For Marathi data, the MahaBERT model, combined with Supervised Contrastive Learning (SupCon), achieved an accuracy of 70.53%, a precision of 52.48%, a recall of 59.21%, and an F1-score of 54.52%. For Telugu data, the TeluguBERT model with SupCon achieved superior performance, with an accuracy of 96.90%, a precision of 96.95%, a recall of 96.98%, and an F1-score of 96.96%.
AB - Homophobia and transphobia are pervasive issues in online platforms, manifesting as hate speech directed towards LGBTQ+ individuals. Identifying and mitigating such toxic language is crucial for creating safer online spaces especially in low resource Indic languages. This work focuses on the task of detecting homophobia, transphobia, and non-anti-LGBT+ content in YouTube comments, which are annotated at the comment/post level in Marathi and Telugu languages. We employed pre-trained and fine-tuned versions of BERT models on Marathi and Telugu data. These models were further fine-tuned with contrastive learning objectives to enhance their discriminatory power. For Marathi data, the MahaBERT model, combined with Supervised Contrastive Learning (SupCon), achieved an accuracy of 70.53%, a precision of 52.48%, a recall of 59.21%, and an F1-score of 54.52%. For Telugu data, the TeluguBERT model with SupCon achieved superior performance, with an accuracy of 96.90%, a precision of 96.95%, a recall of 96.98%, and an F1-score of 96.96%.
KW - Contrastive Learning
KW - Hate Speech
KW - Homophobia
KW - Large Language Models
KW - Transphobia
UR - https://www.scopus.com/pages/publications/105013055149
U2 - 10.1145/3734947.3734950
DO - 10.1145/3734947.3734950
M3 - Conference Publication
AN - SCOPUS:105013055149
T3 - ACM International Conference Proceeding Series
SP - 48
EP - 54
BT - FIRE 2024 - Proceedings of the 16th Annual Meeting of the Forum for Information Retrieval Evaluation
A2 - Ganguly, Debasis
A2 - Sanyal, Debarshi Kumar
A2 - Majumder, Prasenjit
A2 - Majumdar, Srijoni
A2 - Gangopadhyay, Surupendu
PB - Association for Computing Machinery
T2 - 16th Annual Meeting of the Forum for Information Retrieval Evaluation, FIRE 2024
Y2 - 12 December 2024 through 15 December 2024
ER -