A hierarchical and regional deep learning architecture for image description generation

Kinghorn, Philip, Zhang, Li and Shao, Ling (2019) A hierarchical and regional deep learning architecture for image description generation. Pattern Recognition Letters, 119. pp. 77-85. ISSN 0167-8655

[thumbnail of Accepted manuscript]
PDF (Accepted manuscript) - Accepted Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.

Download (501kB) | Preview


This research proposes a distinctive deep learning network architecture for image captioning and description generation. Specifically, we propose a hierarchically trained deep network in order to increase the fluidity and descriptive nature of the generated image captions. The proposed deep network consists of initial regional proposal generation and two key stages for image description generation. The initial regional proposal generation is based upon the Region Proposal Network from the Faster R-CNN. This process generates regions of interest that are then used to annotate and classify human and object attributes. The first key stage of the proposed system conducts detailed label description generation for each region of interest. The second stage uses a Recurrent Neural Network (RNN)-based encoder-decoder structure to translate these regional descriptions into a full image description. Especially, the proposed deep network model can label scenes, objects, human and object attributes, simultaneously, which is achieved through multiple individually trained RNNs The empirical results indicate that our work is comparable to existing research and outperforms state-of-the-art existing methods considerably when evaluated with out-of-domain images from the IAPR TC-12 dataset, especially considering that our system is not trained on images from any of the image captioning datasets. When evaluated with several well-known evaluation metrics, the proposed system achieves an improvement of ∼60% at BLEU-1 over existing methods on the IAPR TC-12 dataset. Moreover, compared with related methods, the proposed deep network requires substantially fewer data samples for training, leading to a much-reduced computational cost.

Item Type: Article
Uncontrolled Keywords: deep neural networks,image captioning,recurrent neural networks,region annotation,software,signal processing,computer vision and pattern recognition,artificial intelligence ,/dk/atira/pure/subjectarea/asjc/1700/1712
Faculty \ School: Faculty of Science > School of Computing Sciences
Related URLs:
Depositing User: Pure Connector
Date Deposited: 09 Sep 2017 05:06
Last Modified: 22 Oct 2022 03:09
URI: https://ueaeprints.uea.ac.uk/id/eprint/64800
DOI: 10.1016/j.patrec.2017.09.013


Downloads per month over past year

Actions (login required)

View Item View Item