TY - GEN
T1 - Synthetic Speaking Children - Why We Need Them and How to Make Them
AU - Ali Farooq, Muhammad
AU - Bigioi, Dan
AU - Jain, Rishabh
AU - Yao, Wang
AU - Yiwere, Mariam
AU - Corcoran, Peter
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Contemporary Human-Computer Interaction (HCI) research relies primarily on neural network models for machine vision and speech understanding of a system user. Such models require extensively annotated training datasets for optimal performance and when building interfaces for users from a vulnerable population such as young children, GDPR introduces significant complexities in data collection, management, and processing. Motivated by the training needs of an Edge-AI smart-toy platform this research explores the latest advances in generative neural technologies and provides a working proof-of-concept of a controllable data-generation pipeline for speech-driven facial training data at scale. In this context, we demonstrate how StyleGAN-2 can be fine-tuned to create a gender-balanced dataset of children's faces. This dataset includes a variety of controllable factors such as facial expressions, age variations, facial poses, and even speech-driven animations with realistic lip synchronization. By combining generative text-to-speech models for child voice synthesis and a 3D landmark-based talking heads pipeline, we can generate highly realistic, entirely synthetic, talking child video clips. These video clips can provide valuable, and controllable, synthetic training data for neural network models, bridging the gap when real data is scarce or restricted due to privacy regulations.
AB - Contemporary Human-Computer Interaction (HCI) research relies primarily on neural network models for machine vision and speech understanding of a system user. Such models require extensively annotated training datasets for optimal performance and when building interfaces for users from a vulnerable population such as young children, GDPR introduces significant complexities in data collection, management, and processing. Motivated by the training needs of an Edge-AI smart-toy platform this research explores the latest advances in generative neural technologies and provides a working proof-of-concept of a controllable data-generation pipeline for speech-driven facial training data at scale. In this context, we demonstrate how StyleGAN-2 can be fine-tuned to create a gender-balanced dataset of children's faces. This dataset includes a variety of controllable factors such as facial expressions, age variations, facial poses, and even speech-driven animations with realistic lip synchronization. By combining generative text-to-speech models for child voice synthesis and a 3D landmark-based talking heads pipeline, we can generate highly realistic, entirely synthetic, talking child video clips. These video clips can provide valuable, and controllable, synthetic training data for neural network models, bridging the gap when real data is scarce or restricted due to privacy regulations.
KW - Facial Image Generation
KW - Low Resource Data
KW - Synthetic Data
KW - Talking Head Generation
KW - Text to Speech Synthesis
UR - https://www.scopus.com/pages/publications/85179515129
U2 - 10.1109/SpeD59241.2023.10314943
DO - 10.1109/SpeD59241.2023.10314943
M3 - Conference Publication
AN - SCOPUS:85179515129
T3 - 2023 International Conference on Speech Technology and Human-Computer Dialogue, SpeD 2023
SP - 36
EP - 41
BT - 2023 International Conference on Speech Technology and Human-Computer Dialogue, SpeD 2023
A2 - Burileanu, Dragos
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 12th International Conference on Speech Technology and Human-Computer Dialogue, SpeD 2023
Y2 - 25 October 2023 through 27 October 2023
ER -