Audio-to-Visual Speech Conversion using Deep Neural Networks

Taylor, Sarah, Kato, Akihiro, Milner, Ben and Matthews, Iain (2016) Audio-to-Visual Speech Conversion using Deep Neural Networks. In: Proceedings of the Interspeech Conference 2016. International Speech Communication Association, USA, pp. 1482-1486.

[thumbnail of Accepted manuscript]
PDF (Accepted manuscript) - Accepted Version
Download (665kB) | Preview


We study the problem of mapping from acoustic to visual speech with the goal of generating accurate, perceptually natural speech animation automatically from an audio speech signal. We present a sliding window deep neural network that learns a mapping from a window of acoustic features to a window of visual features from a large audio-visual speech dataset. Overlapping visual predictions are averaged to generate continuous, smoothly varying speech animation. We outperform a baseline HMM inversion approach in both objective and subjective evaluations and perform a thorough analysis of our results.

Item Type: Book Section
Faculty \ School: Faculty of Science
Faculty of Science > School of Computing Sciences
UEA Research Groups: Faculty of Science > Research Groups > Interactive Graphics and Audio
Faculty of Science > Research Groups > Smart Emerging Technologies
Depositing User: Pure Connector
Date Deposited: 24 Sep 2016 01:06
Last Modified: 20 Apr 2023 01:11
DOI: 10.21437/Interspeech.2016-483


Downloads per month over past year

Actions (login required)

View Item View Item