Deep Binary Representation Learning for Single/Cross-Modal Data Retrieval

Shen, Yuming (2018) Deep Binary Representation Learning for Single/Cross-Modal Data Retrieval. Doctoral thesis, University of East Anglia.

[thumbnail of PhDThesis_YumingShen_100198441_CMP.pdf]
Download (10MB) | Preview


Data similarity search is widely regarded as a classic topic in the realms of computer vision, machine learning and data mining. Providing a certain query, the retrieval model sorts out the related candidates in the database according to their similarities, where representation learning methods and nearest-neighbour search apply. As matching data features in Hamming space is computationally cheaper than in Euclidean space, learning to hash and binary representations are generally appreciated in modern retrieval models. Recent research seeks solutions in deep learning to formulate the hash functions, showing great potential in retrieval performance. In this thesis, we gradually extend our research topics and contributions from unsupervised single-modal deep hashing to supervised cross-modal hashing _nally zero-shot hashing problems, addressing the following challenges in deep hashing.

First of all, existing unsupervised deep hashing works are still not attaining leading retrieval performance compared with the shallow ones. To improve this, a novel unsupervised single-modal hashing model is proposed in this thesis, named Deep Variational Binaries (DVB). We introduce the popular conditional variational auto-encoders to formulate the encoding function. By minimizing the reconstruction error of the latent variables, the proposed model produces compact binary codes without training supervision. Experiments on benchmarked datasets show that our model outperform existing unsupervised hashing methods. The second problem is that current cross-modal hashing methods only consider holistic image representations and fail to model descriptive sentences, which is inappropriate to handle the rich semantics of informative cross-modal data for quality textual-visual search tasks. To handle this problem, we propose a supervised deep cross-modal hashing model called Textual-Visual Deep Binaries (TVDB). Region-based neural networks and recurrent neural networks are involved in the image encoding network in order to make e_ective use of visual information, while the text encoder is built using a convolutional neural network. We additionally introduce an e_cient in-batch optimization routine to train the network parameters. The proposed mode successfully outperforms state-of-the-art methods on large-scale datasets.

Finally, existing hashing models fail when the categories of query data have never been seen during training. This scenario is further extended into a novel zero-shot cross-modal hashing task in this thesis, and a Zero-shot Sketch-Image Hashing (ZSIH) scheme is then proposed with graph convolution and stochastic neurons. Experiments show that the proposed ZSIH model signi_cantly outperforms existing hashing algorithms in the zero-shot retrieval task. Experiments suggest our proposed and novel hashing methods outperform state-of-the-art researches in single-modal and cross-modal data retrieval.

Item Type: Thesis (Doctoral)
Faculty \ School: Faculty of Science > School of Computing Sciences
Depositing User: Stacey Armes
Date Deposited: 17 Jul 2018 15:30
Last Modified: 17 Jul 2018 15:30

Actions (login required)

View Item View Item