Multimodal Dynamic Networks for Gesture Recognition

Wu, Di and Shao, Ling (2014) Multimodal Dynamic Networks for Gesture Recognition. In: Proceedings of the 22nd ACM international conference on Multimedia. Association for Computing Machinery (ACM), pp. 945-948. ISBN 978-1-4503-3063-3

Full text not available from this repository.


Multimodal input is a real-world situation in gesture recognition applications such as sign language recognition. In this paper, we propose a novel bi-modal (audio and skeleton joints) dynamic network for gesture recognition. First, state-of-the-art dynamic Deep Belief Networks are deployed to extract high level audio and skeletal joints representations. Then, instead of traditional late fusion, we adopt another layer of perceptron for cross modality learning taking the input from each individual net's penultimate layer. Finally, to account for temporal dynamics, the learned shared representations are used for estimating the emission probability to infer action sequences. In particular, we demonstrate that multimodal feature learning will extract semantically meaningful shared representations, outperforming individual modalities, and the early fusion scheme's efficacy against the traditional method of late fusion.

Item Type: Book Section
Faculty \ School: Faculty of Science > School of Computing Sciences
Depositing User: Pure Connector
Date Deposited: 10 Feb 2017 02:27
Last Modified: 22 Oct 2022 00:00
DOI: 10.1145/2647868.2654969

Actions (login required)

View Item View Item