计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (9): 9-22.DOI: 10.3778/j.issn.1002-8331.2012-0539

• 热点与综述 • 上一篇    下一篇

深度神经网络图像描述综述

许昊,张凯,田英杰,种法广,王子超   

  1. 1.上海电力大学 计算机科学与技术学院,上海 201300
    2.国家电网公司 上海电器科学研究院,上海 200437
  • 出版日期:2021-05-01 发布日期:2021-04-29

Review of Deep Neural Network-Based Image Caption

XU Hao, ZHANG Kai, TIAN Yingjie, CHONG Faguang, WANG Zichao   

  1. 1.College of Computer Science and Technology, Shanghai University of Electric Power, Shanghai 201300, China
    2.Shanghai Electrical Research Institute, State Grid Corporation of China, Shanghai 200437, China
  • Online:2021-05-01 Published:2021-04-29

摘要:

深度学习的迅速发展使得图像描述效果得到显著提升,针对基于深度神经网络的图像描述方法及其研究现状进行详细综述。图像描述算法结合计算机视觉和自然语言处理的知识,根据图像中检测到的内容自动生成自然语言描述,是场景理解的重要部分。图像描述任务中,一般采用由编码器和解码器组成的基本架构。改进编码器或解码器,应用生成对抗网络、强化学习、无监督学习以及图卷积神经网络等方法能有效提高图像描述算法的性能。对每类方法的代表模型算法的效果以及优缺点进行分析,并介绍适用的公开数据集,在此基础上进行对比实验。对图像描述面临的挑战以及未来工作的发展方向做出展望。

关键词: 深度神经网络, 计算机视觉, 图像描述, 编码器-解码器架构, 注意力机制

Abstract:

With the rapid development of deep learning, the quality of image caption is significantly improved. This paper mainly reviews the methods of image caption based on deep neural network and its research status in detail. Image caption algorithm combines the knowledge of computer vision and natural language processing togenerate natural language descriptions based on the content detected in the image automatically, which is an important part of scene understanding. Generally, the basic architecture of image caption task is composed of encoder and decoder. Improving encoders or decoders, applying methods of Generative Adversarial Networks(GAN). Reinforcement Learning(RL), Unsupervised Learning(UL) and Graph Convolution Neural Network(GCN) can effectively improve the performance of image caption algorithm. Afterward, the effect, advantages and disadvantages of each representative model algorithm are analyzed. Moreover, public datasets are introduced. On this basis, the comparative experiments are carried out. Finally, the challenges of image caption and possibility of future work are prospected.

Key words: deep neural network, computer vision, image caption, encoder-decoder architecture, attention mechanism