Speaker-independent speech animation using perceptual loss functions and synthetic data

Websdale, Danny; Taylor, Sarah; Milner, Ben

doi:10.1109/TMM.2021.3087020

Speaker-independent speech animation using perceptual loss functions and synthetic data

Tools

Websdale, Danny, Taylor, Sarah and Milner, Ben (2021) Speaker-independent speech animation using perceptual loss functions and synthetic data. IEEE Transactions on Multimedia, 24. pp. 2539-2552. ISSN 1520-9210

Preview

PDF (Accepted_Manuscript) - Accepted Version
Download (28MB) | Preview

Abstract

We propose a real-time speaker-independent speech- to-facial animation system that predicts lip and jaw movements on a reference face for audio speech taken from any speaker. Our approach is motivated by two key observations; 1) Speaker- independent facial animation can be generated from phoneme labels, but to perform this automatically a speech recogniser is needed which, due to contextual look-ahead, introduces too much time lag. 2) Audio-driven speech animation can be performed in real-time but requires large, multi-speaker audio-visual speech datasets of which there are few. We adopt a novel three- stage training procedure that leverages the advantages of each approach. First we train a phoneme-to-visual speech model from a large single-speaker audio-visual dataset. Next, we use this model to generate the synthetic visual component of a large multi-speaker audio dataset of which the video is not available. Finally, we learn an audio-to-visual speech mapping using the synthetic visual features as the target. Furthermore, we increase the realism of the predicted facial animation by introducing two perceptually-based loss functions that aim to improve mouth closures and openings. The proposed method and loss functions are evaluated objectively using mean square error, global variance and a new metric that measures the extent of mouth opening. Subjective tests show that our approach produces facial animation comparable to those produced from phoneme sequences and that improved mouth closures, particularly for bilabial closures, are achieved.

Item Type:	Article
Uncontrolled Keywords:	blstm,face recognition,facial animation,hidden markov models,index terms -speech-to-facial animation,mouth,real-time systems,speech recognition,visualization,avatars,perceptual loss functions,speaker-independent,speech animation,talking heads,recurrent neural networks,audio-visual systems,signal processing,electrical and electronic engineering,media technology,computer science applications ,/dk/atira/pure/subjectarea/asjc/1700/1711
Faculty \ School:	Faculty of Science > School of Computing Sciences
UEA Research Groups:	Faculty of Science > Research Groups > Visual Computing and Signal Processing (former - to 2025) Faculty of Science > Research Groups > Smart Emerging Technologies (former - to 2025) Faculty of Science > Research Groups > Data Science and AI Faculty of Science > Research Groups > Cyber Intelligence and Networks
Related URLs:	http://www.scopus.com/inward/record.url?...
Depositing User:	LivePure Connector
Date Deposited:	07 Aug 2021 00:11
Last Modified:	07 Feb 2026 14:36
URI:	https://ueaeprints.uea.ac.uk/id/eprint/81010
DOI:	10.1109/TMM.2021.3087020

Downloads

Downloads per month over past year

Actions (login required)

View Item