Multi-task learning in under-resourced Dravidian languages

  • Adeep Hande
  • , Siddhanth U. Hegde
  • , Bharathi Raja Chakravarthi

Research output: Contribution to a Journal (Peer & Non Peer)Articlepeer-review

19 Citations (Scopus)

Abstract

It is challenging to obtain extensive annotated data for under-resourced languages, so we investigate whether it is beneficial to train models using multi-task learning. Sentiment analysis and offensive language identification share similar discourse properties. The selection of these tasks is motivated by the lack of large labelled data for user-generated code-mixed datasets. This paper works with code-mixed YouTube comments for Tamil, Malayalam, and Kannada languages. Our framework is applicable to other sequence classification problems irrespective to the size of the datasets. Experiments show that our multi-task learning model can achieve high results compared to single-task learning while reducing the time and space constraints required to train the models on individual tasks. Analysis of fine-tuned models indicates the preference of multi-task learning over single task learning resulting in a higher weighted F1 score on all three languages. We apply two multi-task learning approaches to three Dravidian languages, Kannada, Malayalam, and Tamil. Maximum scores on Kannada and Malayalam were achieved by mBERT subjected to cross entropy loss and with an approach of hard parameter sharing. Best scores on Tamil was achieved by DistilBERT subjected to cross entropy loss with soft parameter sharing as the architecture type. For the tasks of sentiment analysis and offensive language identification, the best performing model scored a weighted F1-Score of (66.8%, 90.5%), (59%, 70%) and (62.1%,75.3%) for Kannada, Malayalam and Tamil on sentiment analysis and offensive language identification respectively.

Original languageEnglish
Pages (from-to)137-165
Number of pages29
JournalJournal of Data, Information and Management
Volume4
Issue number2
DOIs
Publication statusPublished - Jun 2022
Externally publishedYes

Keywords

  • Code-mixing
  • Dravidian languages
  • Multi-task learning
  • Offensive language identification
  • Sentiment analysis
  • Under-resourced languages

Fingerprint

Dive into the research topics of 'Multi-task learning in under-resourced Dravidian languages'. Together they form a unique fingerprint.

Cite this