Attention-Based End-to-End Automatic Speech Recognition System for Vulnerable Individuals in Tamil

  • S. Suhasini
  • , B. Bharathi
  • , Bharathi Raja Chakravarthi

Research output: Chapter in Book or Conference Publication/ProceedingChapterpeer-review

2 Citations (Scopus)

Abstract

The process of turning spoken language into written text is called automatic speech recognition (ASR). It is used in many o settings. Automatic speech recognition becomes a crucial tool when daily life is digitized. It is well known that it considerably improves the lives of the elderly and people with disabilities. Minor dysarthria, or slurred speech, is common in elderly people and those who are physically or mentally challenged, which leads to erroneous transcription of the data. In this study, we suggested creating a Tamil-language automatic voice recognition system for the elderly. The ASR system must be trained using elderly people’s speech utterances in order to increase its performance when processing elderly people’s speech. There is no Tamil speech corpus made up of elderly speakers. We recorded elderly and transgender individuals speaking Tamil on the spot. These statements were gathered from people speaking in open spaces including markets, hospitals, and vegetable shops. The speech corpus contains remarks from men, women, and transgender people. In this research, an attention- based, end-to-end paradigm is used to construct an ASR system. The proposed system includes two key steps: creating an audio model and a linguistic model. Recurrent neural network architecture was used in the construction of the language model. The attention-based encoder-decoder architecture was used to construct the acoustic model. The encoder model utilized a convolution network with a recurrent network, and the decoder model utilized an attention-based gated recurrent unit. Word error rate (WER) is used to assess how well the suggested ASR system performs when used on geriatric speaking utterances. The outcomes are compared to several transformer models that have already been trained. By pretraining a single model using the raw waveform of speech in various languages, the pre-trained XLSR models develop cross-lingual speech representations. The Common Voice Tamil voice corpus is used to fine tune the pre-trained models. According to the experiments, the suggested attention-based, end-to-end model performs noticeably better than transformer models that have already been trained.

Original languageEnglish
Title of host publicationAutomatic Speech Recognition and Translation for Low Resource Languages
Publisherwiley
Pages15-26
Number of pages12
ISBN (Electronic)9781394214624
ISBN (Print)9781394213580
DOIs
Publication statusPublished - 1 Jan 2024
Externally publishedYes

Keywords

  • Automatic speech recognition (ASR)
  • cross-lingual speech representations (XLSR)
  • encoder-decoder model
  • hidden Markov model (HMM)
  • recurrent neural network (RNN)
  • transformer model
  • word error rate (WER)

Fingerprint

Dive into the research topics of 'Attention-Based End-to-End Automatic Speech Recognition System for Vulnerable Individuals in Tamil'. Together they form a unique fingerprint.

Cite this