Abstract
The quantity of Old Irish text which survives in contemporary manuscripts is relatively small by comparison to what is available for well-resourced modern languages. Moreover, as it is a historical language, no more text will ever be generated by native speakers of Old Irish. This makes the text which has survived particularly valuable, and ideally, all of it would be annotated using a single, common annotation standard, thereby ensuring compatibility between text resources. At present, Old Irish text repositories separate words or sub-word morphemes in accordance with different methodologies, and each uses a different style of lexical annotation. This makes it difficult to utilise content from more than any one repository in NLP applications. This paper provides an assessment of distinctions between existing annotated corpora, showing that the primary point of divergence is at the token level. For this reason, this paper also describes a new method for tokenising Old Irish text. This method can be applied even to diplomatic editions, and has already been utilised in various text resources.
| Original language | English (Ireland) |
|---|---|
| Title of host publication | Proceedings of the 5th Celtic Language Technology Workshop (CLTW 5) |
| Publisher | International Committee on Computational Linguistics |
| Pages | 1-11 |
| Publication status | Published - 1 Jan 2025 |
Authors (Note for portal: view the doc link for the full list of authors)
- Authors
- Adrian Doyle, John P. McCrae