Audio-visual speaker separation

Khan, Faheem (2016) Audio-visual speaker separation. Doctoral thesis, University of East Anglia.

[img]
Preview
PDF
Download (8MB) | Preview

Abstract

Communication using speech is often an audio-visual experience. Listeners hear what is
being uttered by speakers and also see the corresponding facial movements and other gestures.
This thesis is an attempt to exploit this bimodal (audio-visual) nature of speech for
speaker separation. In addition to the audio speech features, visual speech features are used
to achieve the task of speaker separation. An analysis of the correlation between audio and
visual speech features is carried out first. This correlation between audio and visual features
is then used in the estimation of clean audio features from visual features using Gaussian
MixtureModels (GMMs) andMaximum a Posteriori (MAP) estimation.
For speaker separation three methods are proposed that use the estimated clean audio features.
Firstly, the estimated clean audio features are used to construct aWiener filter to separate
the mixed speech at various signal-to-noise ratios (SNRs) into target and competing
speakers. TheWiener filter gains are modified in several ways in search for improvements in
quality and intelligibility of the extracted speech. Secondly, the estimated clean audio features
are used in developing visually-derived binary masking method for speaker separation.
The estimated audio features are used to compute time-frequency binary masks that identify
the regions where the target speaker dominates. These regions are retained and formthe
estimate of the target speaker’s speech. Experimental results compare the visually-derived
binary masks with ideal binary masks which shows a useful level of accuracy. The effectiveness
of the visually-derived binary mask for speaker separation is then evaluated through
estimates of speech quality and speech intelligibility and shows substantial gains over the
original mixture. Thirdly, the estimated clean audio features and the visually-derivedWiener
filtering are used to modify the operation of an effective audio-only method of speaker separation,
namely the soft mask method, to allow visual speech information to improve the
separation task. Experimental results are presented that compare the proposed audio-visual
speaker separation with the audio-only method using both speech quality and intelligibility
metrics. Finally, a detailed comparison is made of the proposed and existing methods of
speaker separation using objective and subjective measures.

Item Type: Thesis (Doctoral)
Faculty \ School: Faculty of Science > School of Computing Sciences
Depositing User: Jackie Webb
Date Deposited: 02 Sep 2016 10:10
Last Modified: 02 Sep 2016 10:10
URI: https://ueaeprints.uea.ac.uk/id/eprint/59679
DOI:

Actions (login required)

View Item View Item