Khan, Faheem, Milner, Ben P. and Le Cornu, Thomas (2018) Using visual speech information in masking methods for audio speaker separation. IEEE Transactions on Audio, Speech, and Language Processing, 26 (10). pp. 1742-1754. ISSN 1558-7916
Preview |
PDF (Accepted manuscript)
- Accepted Version
Download (399kB) | Preview |
Abstract
This work examines whether visual speech infor- mation can be effective within audio masking-based speaker separation to improve the quality and intelligibility of the target speech. Two visual-only methods of generating an audio mask for speaker separation are first developed. These use a deep neural network to map visual speech features to an audio feature space from which both visually-derived binary masks and visually- derived ratio masks are estimated, before application to the speech mixture. Secondly, an audio ratio masking method forms a baseline approach for speaker separation which is extended to exploit visual speech information to form audio-visual ratio masks. Speech quality and intelligibility tests are carried out on the visual-only, audio-only and audio-visual masking methods of speaker separation at mixing levels from -10dB to +10dB. These reveal substantial improvements in the target speech when applying the visual-only and audio-only masks, but with highest performance occurring when combining audio and visual information to create the audio-visual masks.
Item Type: | Article |
---|---|
Faculty \ School: | Faculty of Science > School of Computing Sciences |
UEA Research Groups: | Faculty of Science > Research Groups > Interactive Graphics and Audio Faculty of Science > Research Groups > Smart Emerging Technologies |
Depositing User: | LivePure Connector |
Date Deposited: | 20 Jun 2018 11:30 |
Last Modified: | 20 Apr 2023 23:45 |
URI: | https://ueaeprints.uea.ac.uk/id/eprint/67404 |
DOI: | 10.1109/TASLP.2018.2835719 |
Downloads
Downloads per month over past year
Actions (login required)
View Item |