Developing a Multilingual Annotated Corpus of Misogyny and Aggression

Research output: Chapter in Book or Conference Publication/ProceedingConference Publicationpeer-review

Abstract

In this paper, we discuss the development of a multilingual annotated corpus of misogyny and aggression in Indian English, Hindi, and Indian Bangla as part of a project on studying and automatically identifying misogyny and communalism on social media (the ComMA Project). The dataset is collected from comments on YouTube videos and currently contains a total of over 20,000 comments. The comments are annotated at two levels - aggression (overtly aggressive, covertly aggressive, and non-aggressive) and misogyny (gendered and non-gendered). We describe the process of data collection, the tagset used for annotation, and issues and challenges faced during the process of annotation. Finally, we discuss the results of the baseline experiments conducted to develop a classifier for misogyny in the three languages.
Original languageEnglish (Ireland)
Title of host publicationProceedings of the Second Workshop on Trolling, Aggression and Cyberbullying
Place of PublicationMarseille, France
Publication statusPublished - 1 May 2020

Authors (Note for portal: view the doc link for the full list of authors)

  • Authors
  • Bhattacharya, Shiladitya; Singh, Siddharth; Kumar, Ritesh; Bansal, Akanksha; Bhagat, Akash; Dawer, Yogesh; Lahiri, Bornini and Ojha, Atul Kr.

Fingerprint

Dive into the research topics of 'Developing a Multilingual Annotated Corpus of Misogyny and Aggression'. Together they form a unique fingerprint.

Cite this