Decoding visemes: improving machine lip-reading

Bear, Helen L. (2016) Decoding visemes: improving machine lip-reading. Doctoral thesis, University of East Anglia.

Download (29MB) | Preview


This thesis is about improving machine lip-reading, that is, the classi�cation of
speech from only visual cues of a speaker. Machine lip-reading is a niche research
problem in both areas of speech processing and computer vision.
Current challenges for machine lip-reading fall into two groups: the content of the
video, such as the rate at which a person is speaking or; the parameters of the video
recording for example, the video resolution. We begin our work with a literature
review to understand the restrictions current technology limits machine lip-reading
recognition and conduct an experiment into resolution a�ects. We show that high
de�nition video is not needed to successfully lip-read with a computer.
The term \viseme" is used in machine lip-reading to represent a visual cue or
gesture which corresponds to a subgroup of phonemes where the phonemes are
indistinguishable in the visual speech signal. Whilst a viseme is yet to be formally
de�ned, we use the common working de�nition `a viseme is a group of phonemes
with identical appearance on the lips'. A phoneme is the smallest acoustic unit a
human can utter. Because there are more phonemes per viseme, mapping between
the units creates a many-to-one relationship. Many mappings have been presented,
and we conduct an experiment to determine which mapping produces the most
accurate classi�cation. Our results show Lee's [82] is best. Lee's classi�cation also
outperforms machine lip-reading systems which use the popular Fisher [48] phonemeto-
viseme map.
Further to this, we propose three methods of deriving speaker-dependent phonemeto-
viseme maps and compare our new approaches to Lee's. Our results show the
sensitivity of phoneme clustering and we use our new knowledge for our �rst suggested
augmentation to the conventional lip-reading system.
Speaker independence in machine lip-reading classi�cation is another unsolved
obstacle. It has been observed, in the visual domain, that classi�ers need training
on the test subject to achieve the best classi�cation. Thus machine lip-reading is
highly dependent upon the speaker. Speaker independence is the opposite of this,
or in other words, is the classi�cation of a speaker not present in the classi�er's
training data. We investigate the dependence of phoneme-to-viseme maps between
speakers. Our results show there is not a high variability of visual cues, but there is
high variability in trajectory between visual cues of an individual speaker with the
same ground truth. This implies a dependency upon the number of visemes within
each set for each individual.
Finally, we investigate how many visemes is the optimum number within a set.
We show the phoneme-to-viseme maps in literature rarely have enough visemes
and the optimal number, which varies by speaker, ranges from 11 to 35. The last
di�culty we address is decoding from visemes back to phonemes and into words.
Traditionally this is completed using a language model. The language model unit is
either: the same as the classi�er, e.g. visemes or phonemes; or the language model
unit is words. In a novel approach we use these optimum range viseme sets within
hierarchical training of phoneme labelled classi�ers. This new method of classi�er
training demonstrates signi�cant increase in classi�cation with a word language

Item Type: Thesis (Doctoral)
Faculty \ School: Faculty of Science > School of Computing Sciences
Depositing User: Vailele Chittock
Date Deposited: 17 Jun 2016 09:23
Last Modified: 17 Jun 2016 09:23

Actions (login required)

View Item View Item