An Assessment of Word Separation Practices in Old Irish Text Resources and a Universal Method for Tokenising Old Irish Text

Research output: Chapter in Book or Conference Publication/ProceedingConference Publicationpeer-review

Abstract

The quantity of Old Irish text which survives in contemporary manuscripts is relatively small by comparison to what is available for well-resourced modern languages. Moreover, as it is a historical language, no more text will ever be generated by native speakers of Old Irish. This makes the text which has survived particularly valuable, and ideally, all of it would be annotated using a single, common annotation standard, thereby ensuring compatibility between text resources. At present, Old Irish text repositories separate words or sub-word morphemes in accordance with different methodologies, and each uses a different style of lexical annotation. This makes it difficult to utilise content from more than any one repository in NLP applications. This paper provides an assessment of distinctions between existing annotated corpora, showing that the primary point of divergence is at the token level. For this reason, this paper also describes a new method for tokenising Old Irish text. This method can be applied even to diplomatic editions, and has already been utilised in various text resources.
Original languageEnglish (Ireland)
Title of host publicationProceedings of the 5th Celtic Language Technology Workshop (CLTW 5)
PublisherInternational Committee on Computational Linguistics
Pages1-11
Publication statusPublished - 1 Jan 2025

Authors (Note for portal: view the doc link for the full list of authors)

  • Authors
  • Adrian Doyle, John P. McCrae

Fingerprint

Dive into the research topics of 'An Assessment of Word Separation Practices in Old Irish Text Resources and a Universal Method for Tokenising Old Irish Text'. Together they form a unique fingerprint.

Cite this