计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (20): 279-286.DOI: 10.3778/j.issn.1002-8331.2006-0088

• 工程与应用 • 上一篇    下一篇

基于分层数据筛选的跨项目缺陷预测方法

赵宇,祝义,于巧,陈小颖   

  1. 1.江苏师范大学 计算机科学与技术学院,江苏 徐州 221116
    2.南京航空航天大学 计算机科学与技术学院,南京 210016
  • 出版日期:2021-10-15 发布日期:2021-10-21

Cross Project Defect Prediction Method Based on Hierarchical Data Screening

ZHAO Yu, ZHU Yi, YU Qiao, CHEN Xiaoying   

  1. 1.School of Computer Science and Technology, Jiangsu Normal University, Xuzhou, Jiangsu 221116, China
    2.School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
  • Online:2021-10-15 Published:2021-10-21

摘要:

跨项目缺陷预测旨在解决传统的项目内缺陷预测的历史数据缺失,新项目初期缺乏训练数据等实际问题。然而,在跨项目缺陷预测中,不同项目之间以及实例之间的数据分布差异降低了其预测性能。针对这一问题,提出了基于分层数据筛选的跨项目缺陷预测方法。该方法将训练数据的筛选过程分为项目层筛选和实例层筛选,从源数据集中选出与目标项目数据分布最接近的候选项目集,在候选项目集中选出与目标项目中实例相似度较高的训练数据集,最后在训练数据集上训练朴素贝叶斯模型。在PROMISE数据集进行实验对比。结果表明,与项目内缺陷预测比较,提出的分层数据筛选方法优于项目内缺陷预测,并且有效降低了训练数据和目标项目数据之间的差异性。

关键词: 跨项目缺陷预测, 分层数据筛选, 朴素贝叶斯模型

Abstract:

Cross-project defect prediction aims to solve the practical problems such as the lack of historical data of traditional within-project defect prediction and the lack of training data in the initial stage of new projects. However, in cross project defect prediction, the difference in data distribution between different projects and instances reduces its prediction performance. In response to this problem, a cross project defect prediction method based on hierarchical data screening is proposed. This method divides the screening process of training data into project-level screening and instance-level screening. Firstly, the candidate project set closest to the target project data distribution is selected from the source dataset. Secondly, the candidate project set is selected to be similar to the instance in the target project. Training dataset is with higher similary, and finally it trains the Naive Bayes model on the training dataset. Experiment is done in PROMISE dataset. The results show that compared with in-project defect prediction, the hierarchical data screening method proposed is superior to within project defect prediction, and effectively reduces the difference between training data and target project data.

Key words: cross-project defect prediction, hierarchical data screening, Naive Bayes model