计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (16): 84-93.DOI: 10.3778/j.issn.1002-8331.2110-0289

• 理论与研发 • 上一篇    下一篇

基于程序结构与语义特征融合的软件缺陷预测

董玉坤,李浩杰,位欣欣,唐道龙   

  1. 中国石油大学(华东) 计算机科学与技术学院,山东 青岛 266580
  • 出版日期:2022-08-15 发布日期:2022-08-15

Software Defect Prediction Based on Features Fusion of Program Structure and Semantics

DONG Yukun, LI Haojie, WEI Xinxin, TANG Daolong   

  1. College of Computer Science and Technology, China University of Petroleum(East China), Qingdao, Shandong 266580, China
  • Online:2022-08-15 Published:2022-08-15

摘要: 随着软件系统的规模越来越庞大,如何快速高效地预测软件中的程序缺陷成为一个研究热点。最近的研究引入了深度学习模型,使用神经网络提取代码特征构建分类器进行缺陷预测。针对现有的神经网络只在单层面、单粒度上提取代码特征,导致特征不够丰富,造成预测精度不高的问题,提出了一种基于特征融合的软件缺陷预测框架。通过将程序解析为抽象语法树(abstract syntax tree,AST)以及Token序列两种不同的程序表示方式,利用树卷积神经网络以及文本卷积神经网络分别提取代码的结构和语义特征进行特征融合,从而提取到更丰富的代码特征用于缺陷预测。同时改进了AST和Token序列提取方法,降低模型复杂度。选择使用公共存储库PROMISE中的公开数据集作为实验数据集,采用softmax分类器预测得到最终的预测结果。实验结果表明,该框架在实验数据集上可以获得比已有方法更高的F1-score。

关键词: 软件缺陷预测, 特征融合, 树卷积神经网络(TBCNN), 文本卷积神经网络(TextCNN)

Abstract: With the increasing scale of software system, how to predict the defects quickly and efficiently has become a research hotspot. Recent researches introduce deep learning models that use neural networks to extract code features to build classifiers for defect prediction. Aiming at the problem that the existing neural networks usually extract code features in single ingredient and single granularity, which results features insufficient and prediction imprecise, a software defect prediction framework based on feature fusion is proposed. By parsing the program into two different program representations:AST(abstract syntax tree) and Token sequence, tree-based convolutional neural network and text convolution neural network are used to extract the structural and semantic features of code respectively for feature fusion, so as to extract richer code features for defect prediction. The extraction methods of AST and Token sequences are improved to reduce the complexity of the model. The public dataset in PROMISE is selected as experimental dataset, and softmax is used as a classifier to predict the results. The experimental results show that the framework obtains higher F1-score than the existing methods on the experimental dataset.

Key words: software defect prediction, features fusion, tree-based convolutional neural network(TBCNN), text convolutional neural networks(TextCNN)