Le Cornu, Thomas and Milner, Ben P. (2017) Generating intelligible audio speech from visual speech. IEEE Transactions on Audio, Speech, and Language Processing, 25 (9). pp. 1447-1457. ISSN 1558-7916
Preview |
PDF (Accepted manuscript)
- Accepted Version
Download (989kB) | Preview |
Abstract
This work is concerned with generating intelligible audio speech from a video of a person talking. Regression and classification methods are proposed first to estimate static spectral envelope features from active appearance model (AAM) visual features. Two further methods are then developed to incorporate temporal information into the prediction - a feature-level method using multiple frames and a model-level method based on recurrent neural networks. Speech excitation information is not available from the visual signal, so methods to artificially generate aperiodicity and fundamental frequency are developed. These are combined within the STRAIGHT vocoder to produce a speech signal. The various systems are optimised through objective tests before applying subjective intelligibility tests that determine a word accuracy of 85% from a set of human listeners on the GRID audio-visual speech database. This compares favourably with a previous regression-based system that serves as a baseline which achieved a word accuracy of 33%.
Item Type: | Article |
---|---|
Faculty \ School: | Faculty of Science > School of Computing Sciences |
UEA Research Groups: | Faculty of Science > Research Groups > Interactive Graphics and Audio Faculty of Science > Research Groups > Smart Emerging Technologies Faculty of Science > Research Groups > Data Science and AI |
Depositing User: | Pure Connector |
Date Deposited: | 07 Jul 2017 05:05 |
Last Modified: | 10 Dec 2024 01:29 |
URI: | https://ueaeprints.uea.ac.uk/id/eprint/64052 |
DOI: | 10.1109/TASLP.2017.2716178 |
Downloads
Downloads per month over past year
Actions (login required)
View Item |