Universal Dependency Treebanks for Low-Resource Indian Languages: The Case of Bhojpuri

    Research output: Chapter in Book or Conference Publication/ProceedingConference Publicationpeer-review

    Abstract

    This paper presents the first dependency treebank for Bhojpuri, a resource-poor language that belongs to the Indo-Aryan language family. The objective behind the Bhojpuri Treebank (BHTB) project is to create a substantial, syntactically annotated treebank which not only acts as a valuable resource in building language technological tools, also helps incross-lingual learning and typological research. Currently, the treebank consists of 4,881 annotated tokens in accordance with the annotation scheme of Universal Dependencies (UD). A Bhojpuri tagger and parser were created using machine learning approach. The accuracy of the model is 57.49% UAS, 45.50% LAS, 79.69% UPOS accuracy, and 77.64% XPOSaccuracy. The paper describes the details of the project including a discussion on linguistic analysis and annotation process of the Bhojpuri UD treebank.
    Original languageEnglish (Ireland)
    Title of host publicationProceedings of the WILDRE5--5th Workshop on Indian Language Data: Resources and Evaluation
    Place of PublicationOnline
    Publication statusPublished - 1 Jan 2020

    Authors (Note for portal: view the doc link for the full list of authors)

    • Authors
    • Ojha, Atul Kr. and Zeman, Daniel

    Fingerprint

    Dive into the research topics of 'Universal Dependency Treebanks for Low-Resource Indian Languages: The Case of Bhojpuri'. Together they form a unique fingerprint.

    Cite this