Windle, Jonathan (2025) Digital Humans: Automatic Character Animation. Doctoral thesis, University of East Anglia.
| Preview | PDF Download (25MB) | Preview | 
Abstract
This thesis covers topics related to automatic generation of co-speech gesture animation. This vast field traditionally employs automatic rule-based, statistical, and machine learning approaches. This thesis expands on machine learning approaches, applying new methods to co-speech gesture generation. Initially, one of the most extensive co-speech gesture datasets is examined to provide insight into gesture production and lateral symmetry in gestures. The thesis then focuses on the application of four machine learning generative modelling approaches. Each proposed method answers a specific research question while simultaneously striving for the best performance in automatic gesture animation.
First, the common data augmentation technique of lateral mirroring is shown to be problematic through dataset analysis, which also introduces new gesture analysis methods and statistically derived gesture spaces. The effect of using multiple, body-part-specific decoders is compared to a single decoder that predicts the whole body. This experiment finds that leg motion is negatively impacted while the arms and hands benefit. A novel style-controlled diffusion model focusing on the impact of long-term historical knowledge is introduced. This sheds light on the importance of historical memory, finding performance improved when extended, producing smooth, contextually correct animation with emotive style control. Conversational speech often occurs in a dyadic setting, where the other person’s response influences communication, such as through back-channel communication. A model including the second speaker’s speech as a feature shows minor improvements, particularly in head nods and gesture turn-taking. An experiment using Large-Language Models (LLMs) as a feature extractor is performed and evaluated to determine their effectiveness in isolation and combination with audio features. Using LLAMA2 features enables well-timed, contextually rich gestures without an audio embedding demonstrating that Large-Language Model (LLM) features contribute more to the perceived quality of the results than audio features. These findings offer valuable insights for improving automatic co-speech gesture generation.
| Item Type: | Thesis (Doctoral) | 
|---|---|
| Faculty \ School: | Faculty of Science > School of Computing Sciences | 
| Depositing User: | Chris White | 
| Date Deposited: | 20 May 2025 07:31 | 
| Last Modified: | 20 May 2025 07:31 | 
| URI: | https://ueaeprints.uea.ac.uk/id/eprint/99304 | 
| DOI: | 
Downloads
Downloads per month over past year
Actions (login required)
|  | View Item | 
 
         Tools
 Tools Tools
 Tools