Using visual speech information and perceptually motivated loss functions for binary mask estimation

Websdale, Danny; Milner, Ben

doi:10.21437/AVSP.2017-9

Using visual speech information and perceptually motivated loss functions for binary mask estimation

Tools

Websdale, Danny and Milner, Ben (2017) Using visual speech information and perceptually motivated loss functions for binary mask estimation. In: UNSPECIFIED.

Preview

PDF (Accepted manuscript) - Accepted Version
Download (564kB) | Preview

Abstract

This work is concerned with using deep neural networks for estimating binary masks within a speech enhancement framework. We first examine the effect of supplementing the audio features used in mask estimation with visual speech information. Visual speech is known to be robust to noise although not necessarily as discriminative as audio features, particularly at higher signal-to-noise ratios. Furthermore, most DNN approaches to mask estimate use the cross-entropy (CE) loss function which aims to maximise classification accuracy. However, we first propose a loss function that aims to maximise the hit minus false-alarm (HIT-FA) rate of the mask, which is known to correlate more closely to speech intelligibility than classification accuracy. We then extend this to a hybrid loss function that combines both the CE and HIT-FA loss functions to provide a balance between classification accuracy and HIT-FA rate of the resulting masks. Evaluations of the perceptually motivated loss functions are carried out using the GRID and larger RM-3000 datasets and show improvements to HIT-FA rate and ESTOI across all noises and SNRs tested. Tests also found that supplementing audio with visual information into a single bimodal audio-visual system gave best performance for all measures and conditions tested.

Item Type:	Conference or Workshop Item (Paper)
Additional Information:	Cite as: Websdale, D., Milner, B. (2017) Using visual speech information and perceptually motivated loss functions for binary mask estimation. Proc. The 14th International Conference on Auditory-Visual Speech Processing, 41-46, doi: 10.21437/AVSP.2017-9
Faculty \ School:	Faculty of Science > School of Computing Sciences
UEA Research Groups:	Faculty of Science > Research Groups > Visual Computing and Signal Processing (former - to 2025) Faculty of Science > Research Groups > Smart Emerging Technologies (former - to 2025) Faculty of Science > Research Groups > Data Science and AI Faculty of Science > Research Groups > Cyber Intelligence and Networks
Depositing User:	Pure Connector
Date Deposited:	07 Jul 2017 05:09
Last Modified:	05 Feb 2026 06:31
URI:	https://ueaeprints.uea.ac.uk/id/eprint/64060
DOI:	10.21437/AVSP.2017-9

Downloads

Downloads per month over past year

Actions (login required)

View Item