Improving speaker-independent visual language identification using deep neural networks with training batch augmentation

Newman, Jacob L. (2025) Improving speaker-independent visual language identification using deep neural networks with training batch augmentation. Intelligent Systems with Applications, 26. ISSN 2667-3053

[thumbnail of Newman_2025_IntelligentSystemsWithApplications]
Preview
PDF (Newman_2025_IntelligentSystemsWithApplications) - Published Version
Available under License Creative Commons Attribution.

Download (2MB) | Preview

Abstract

Visual Language Identification (VLID) is concerned with using the appearance and movement of the mouth to determine the identity of spoken language. VLID has applications where conventional audio based approaches are ineffective due to acoustic noise, or where an audio signal is unavailable, such as remote surveillance. The main challenge associated with VLID is the speaker-dependency of image based visual recognition features, which bear little meaningful correspondence between speakers. In this work, we examine a novel VLID task using video of 53 individuals reciting the Universal Declaration of Human Rights in their native languages of Arabic, English or Mandarin. We describe a speaker-independent, five fold cross validation experiment, where the task is to discriminate the language spoken in 10 s videos of the mouth. We use the YOLO object detection algorithm to track the mouth through time, and we employ an ensemble of 3D Convolutional and Recurrent Neural Networks for this classification task. We describe a novel approach to the construction of training batches, in which samples are duplicated, then reversed in time to form a distractor class. This method encourages the neural networks to learn the discriminative temporal features of language rather than the identity of individual speakers. The maximum accuracy obtained across all three language experiments was 84.64%, demonstrating that the system can distinguish languages to a good degree, from just 10 s of visual speech. A 7.77% improvement on classification accuracy was obtained using our distractor class approach compared to normal batch selection. The use of ensemble classification consistently outperformed the results of individual networks, increasing accuracies by up to 7.27%. In a two language experiment intended to provide a comparison with our previous work, we observed an absolute improvement in classification accuracy of 3.6% (90.01% compared to 83.57%).

Item Type: Article
Additional Information: Code and Data Availability: The data used in this article are available upon reasonable request. The code used and a description of how to use it can be found at https://github.com/JNewmanUEA/VLID. Funding information: This work was originally funded by the EPSRC under EP/E028047/1.
Uncontrolled Keywords: language identification,lip reading,neural networks,time series classification,computer science (miscellaneous),artificial intelligence,signal processing,computer vision and pattern recognition,computer science applications ,/dk/atira/pure/subjectarea/asjc/1700/1701
Faculty \ School: Faculty of Science > School of Computing Sciences
Related URLs:
Depositing User: LivePure Connector
Date Deposited: 25 Apr 2025 17:30
Last Modified: 13 May 2025 08:30
URI: https://ueaeprints.uea.ac.uk/id/eprint/99105
DOI: 10.1016/j.iswa.2025.200517

Downloads

Downloads per month over past year

Actions (login required)

View Item View Item