Computer Engineering and Applications ›› 2021, Vol. 57 ›› Issue (20): 53-63.DOI: 10.3778/j.issn.1002-8331.2106-0368

Previous Articles     Next Articles

Survey of Deep Learning Applied in Code Representation

XIE Chunli, LIANG Yao, WANG Xia   

  1. School of Computer Science and Technology, Jiangsu Normal University, Xuzhou, Jiangsu 221116, China
  • Online:2021-10-15 Published:2021-10-21

深度学习在代码表征中的应用综述

谢春丽,梁瑶,王霞   

  1. 江苏师范大学 计算机科学与技术学院,江苏 徐州 221116

Abstract:

Source code representation is an important technology of code numerization, which is the foundation of code cloning detection, code recommendation, code plagiarism and other applications in software engineering domain. It helps programmers to generate or analyze code. It has become a core technology and a hot topic in the field of software engineering. Researchers have conducted a series of researches on code representation. The methods can be divided into text-based representation, syntactic based representation, semantic based representation and function based representation according to different ways of using code information, can be divided into words based representation, statement based representation and function based representation; according to representation granularity, and can be divided into statistical based model, natural language based model and deep learning based representation according to representation methods. In this paper, it first investigates the recent research work of deep learning based code representation which maps source code into a set of continuous space vectors to extract the underlying intrinsic properties. Then it discusses the granularity of representation, abstract level, representation model and application. Finally, this paper summarizes the future development trend of deep learning based code representation.

Key words: deep learning, code representation, representation model, representation granularity

摘要:

代码表征是对代码数值化的一种技术,把代码映射为一组连续的实值向量,提取隐藏在代码内部的属性,辅助程序员生成或分析代码,是代码克隆、代码推荐、代码剽窃等软件工程任务的核心技术和研究热点。研究者们对代码表征方面进行了一系列研究,根据源代码抽取信息的方式,分为基于文本的表征、基于语法的表征、基于语义的表征和基于功能的表征;根据表征粒度的大小,分为基于词汇的表征、基于语句的表征、基于函数的表征等不同等级;根据表征方法的不同,分为基于统计的模型、基于自然语言的模型和基于深度学习的模型。对近几年基于深度学习的代码表征研究进展进行了综述,并从表征粒度、表征层次、表征模型、应用场景等方面对现有工作进行了概括、比较和分析。对基于深度学习的代码表征的未来发展趋势进行分析和展望。

关键词: 深度学习, 代码表征, 表征模型, 表征粒度