Cross Project Defect Prediction Method Based on Hierarchical Data Screening

doi:10.3778/j.issn.1002-8331.2006-0088

Abstract

Abstract:

Cross-project defect prediction aims to solve the practical problems such as the lack of historical data of traditional within-project defect prediction and the lack of training data in the initial stage of new projects. However, in cross project defect prediction, the difference in data distribution between different projects and instances reduces its prediction performance. In response to this problem, a cross project defect prediction method based on hierarchical data screening is proposed. This method divides the screening process of training data into project-level screening and instance-level screening. Firstly, the candidate project set closest to the target project data distribution is selected from the source dataset. Secondly, the candidate project set is selected to be similar to the instance in the target project. Training dataset is with higher similary, and finally it trains the Naive Bayes model on the training dataset. Experiment is done in PROMISE dataset. The results show that compared with in-project defect prediction, the hierarchical data screening method proposed is superior to within project defect prediction, and effectively reduces the difference between training data and target project data.

Key words: cross-project defect prediction, hierarchical data screening, Naive Bayes model

摘要：

跨项目缺陷预测旨在解决传统的项目内缺陷预测的历史数据缺失，新项目初期缺乏训练数据等实际问题。然而，在跨项目缺陷预测中，不同项目之间以及实例之间的数据分布差异降低了其预测性能。针对这一问题，提出了基于分层数据筛选的跨项目缺陷预测方法。该方法将训练数据的筛选过程分为项目层筛选和实例层筛选，从源数据集中选出与目标项目数据分布最接近的候选项目集，在候选项目集中选出与目标项目中实例相似度较高的训练数据集，最后在训练数据集上训练朴素贝叶斯模型。在PROMISE数据集进行实验对比。结果表明，与项目内缺陷预测比较，提出的分层数据筛选方法优于项目内缺陷预测，并且有效降低了训练数据和目标项目数据之间的差异性。

关键词: 跨项目缺陷预测, 分层数据筛选, 朴素贝叶斯模型

ZHAO Yu, ZHU Yi, YU Qiao, CHEN Xiaoying. Cross Project Defect Prediction Method Based on Hierarchical Data Screening[J]. Computer Engineering and Applications, 2021, 57(20): 279-286.

赵宇，祝义，于巧，陈小颖. 基于分层数据筛选的跨项目缺陷预测方法[J]. 计算机工程与应用, 2021, 57(20): 279-286.

References

[1] 宫丽娜，姜淑娟，姜丽.软件缺陷预测技术研究进展[J].软件学报，2019，30（10）：3090-3114.
GONG L N，JIANG S J，JIANG L.Research progress of software defect prediction[J].Journal of Software，2019，30（10）：3090-3114.
[2] 陈翔，王莉萍，顾庆，等.跨项目软件缺陷预测方法研究综述[J].计算机学报，2018，41（1）：254-274.
CHEN X，WANG L P，GU Q，et al.A survey on cross-project software defect prediction methods[J].Chinese Journal of Computer，2018，41（1）：254-274.
[3] 程铭，毋国庆，袁梦霆.基于迁移学习的软件缺陷预测[J].电子学报，2016，44（1）：115-122.
CHENG M，WU G Q，YUAN M T.Transfer learning for software defect prediction[J].Chinese Journal of Electronics，2016，44（1）：115-122.
[4] HERBOLD S.Training data selection for cross-project defect prediction[C]//Proceedings of the International Conference on Predictive Models in Software Engineering，Baltimore，USA，2013：61-69.
[5] 李勇，黄志球，王勇，等.基于多源数据的跨项目软件缺陷预测[J].吉林大学学报（工学版），2016，46（6）：2034-2041.
LI Y，HUANG Z Q，WANG Y，et al.New approach of cross-project defect prediction based on multi-source data[J].Journal of Jilin University（Engineering and Technology Edition），2016，46（6）：2034-2041.
[6] MA Y，LUO G，ZENG X，et al.Transfer learning for cross-company software defect prediction[J].Information and Software Technology，2012，54（3）：248-256.
[7] BRIAND L，MELO W，WUST J.Assessing the applicability of fault-proneness models across object-oriented software projects[J].IEEE Transactions on Software Engineering，2002，28（7）：706-720.
[8] ZIMMERMANN T，NAGAPPAN N，GALL H，et al.Cross-project defect prediction：a large scale experiment on data vs domain vs process[C]//Proceedings of the Joint Meeting of the European Software Engineering Conference and Symposium on the Foundations of Software Engineering，Amsterdam，Netherlands，2009：91-100.
[9] HE Z，SHU F，YANG Y，et al.An investigation on the feasibility of cross-project defect prediction[J].Automated Software Engineering，2012，19（2）：167-199.
[10] TURHAN B，MENZIES T，BENER A B，et al.On the relative value of cross-company and within-company data for defect prediction[J].Empirical Software Engineering，2009，14（5）：540-578.
[11] PETERS F，MENZIES T，MARCUS A.Better cross company defect prediction[C]//2013 10th Working Conference on Mining Software Repositories，San Francisco，USA，2013：409-418.
[12] NAM J，PAN S J，KIM S.Transfer defect learning[C]//2013 35th International Conference on Software Engineering，San Francisco，USA，2013：382-391.
[13] PANICHELLA A，OLIVETO R，LUCIA A D.Cross-project defect prediction models：L'Union fait la force[C]//2014 Software Evolution Week-IEEE Conference on Software Maintenance，Reengineering and Reverse Engineering，Antwerp，Belgium，2014：164-173.
[14] PENG H，BING L，XIAO L，et al.An empirical study on software defect prediction with a simplified metric set[J].Information and Software Technology，2015，59：170-190.
[15] ANA E，KOICHIRO O.Towards logistic regression models for predicting fault-prone code across software projects[C]//Proceedings of the 3rd International Symposium on Empirical Software Engineering and Measurement，2009：460-463.
[16] ZHANG F，MOCKUS A，KEIVANLOO I，et al.Towards building a universal defect prediction model[C]//Proceedings of the 11th Working Conference on Mining Software Repositories，2014：182-191.
[17] WATANABE S，KAIYA H，KAIJIRI K.Adapting a fault prediction model to allow inter language reuse[C]//Proceedings of the 4th International Workshop on Predictor Models in Software Engineering，2008：19-24.
[18] HE Z，PETERS F，MENZIES T，et al.Learning from open-source projects：an empirical study on defect prediction[C]//Proceedings of the 7th International Symposium on Empirical Software Engineering and Measurement，Baltimore，USA，2013：45-54.
[19] 王星，何鹏，陈丹，等.跨项目缺陷预测中训练数据选择方法[J].计算机应用，2016，36（11）：3165-3169.
WANG X，HE P，CHEN D，et al.Selection of training data for cross-project defect prediction[J].Journal of Computer Applications，2016，36（11）：3165-3169.
[20] JOHN G H，LANGLEY P.Estimating continuous distributions in Bayesian classifiers[C]//Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence，Montreal，Canada，1995：338-345.
[21] PLATT J.Sequential minimal optimization：a fast algorithm for training support vector machines[EB/OL].[2020-03-15].https：//www.microsoft.com/en-us/research/publication/sequential-minimal-optimization-a-fast-algorithm-for-training-support-vector-machines/.
[22] CORTES C，VAPNIK V N.Support vector networks[J].Machine Learning，1995，20（3）：273-297.
[23] 于巧.基于机器学习的软件缺陷预测方法研究[D].徐州：中国矿业大学，2017.
YU Q.Research on software defect prediction based on machine learning[D].Xuzhou：China University of Mining and Technology，2017.
[24] JURECZKO M，MADEYSKI L.Towards identifying software project clusters with regard to defect prediction[C]//Proceedings of the 6th International Conference on Predictive Models in Software Engineering，2010：1-10.
[25] CHIDAMBER S R，KEMERER C F.A metrics suite for object oriented design[J].IEEE Transactions on Software Engineering，1994，20（6）：476-493.
[26] 陈翔，顾庆，刘望舒，等.静态软件缺陷预测方法研究[J].软件学报，2016，27（1）：1-25.
CHEN X，GU Q，LIU W S，et al.Survey of static software defect prediction[J].Journal of Software，2016，27（1）：1-25.
[27] YU Q，JIANG S J，YANG Z.A feature matching and transfer approach for cross-company defect prediction[J].Journal of Systems & Software，2017，132（10）：366-378.
[28] YU Q，JIANG S J，WANG R C，et al.A feature selection approach based on a similarity measure for software defect prediction[J].Frontiers of Information Technology & Electronic Engineering，2017，18（11）：77-87.