融合多粒度代码特征和孤立森林算法的配置类型识别

doi:10.3778/j.issn.1002-8331.2408-0377

摘要/Abstract

摘要： “高内聚、低耦合”设计原则的普及应用，使得代码中通常存在着专门管理配置选项或配置方法的特殊类型，称为配置类型。配置类型有助于研究人员从属性角度和行为角度增进对配置机制的理解，并为配置错误处理技术提供必要的选项集合以及选项数据流信息。然而，配置类型研究尚不充分，其识别仍依赖于人工检索。提出一种融合多粒度代码特征和孤立森林算法的配置类型识别方法。基于10个具有代表性的开源软件，手动构建配置类型数据集，通过实证调研配置类型的分布、分类和识别影响因素，总结得到9个调研结果，用于指导配置类型识别。基于调研结果，选取覆盖代码词汇、结构、语义和语法信息的4个类型级粗粒度特征和3个方法级细粒度特征，并为每个特征设计量化算法。考虑到配置类型存在样本类别分布不平衡问题，将识别问题转化为异常检测问题，利用孤立森林算法推荐配置类型，同时设计启发规则减少误报数量。在5个评估软件上的实验结果表明，该方法能识别出每个软件的配置类型，平均精度均值为0.86，平均时间开销为21 min，已初步具备代替人工识别的能力。

关键词: 软件配置, 配置类型识别, 实证调研, 多粒度代码特征, 孤立森林, 配置方法

Abstract: The widespread application of the design principle of “high cohesion and low coupling” has led to types in source code, which are dedicated to managing configuration options or methods, called configuration types. Configuration types help researchers to understand configuration mechanisms from both attribute and behavioral perspectives, and provide the necessary option set and option data flow information for configuration error-handling techniques. However, the research on configuration types is not enough, and their identification still relies on manual retrieval. A configuration type identification method that integrates multi-granularity code features and the isolation forest algorithm is proposed to address the above issue. First, a configuration type dataset is manually constructed for ten representative open-source software pro-jects. Through empirical research on the distribution, classification, and factors influencing the identification of configuration types, nine results are summarized to guide the identification of configuration types. Then, based on the research results, four type-level coarse-grained features and three method-level fine-grained features covering code vocabulary, structure, semantics and syntax information are selected, and a quantization algorithm is designed for each feature. Finally, considering the imbalanced sample category distribution of configuration types, the identification is transformed into an anomaly detection. The isolation forest algorithm is utilized to recommend configuration types, while heuristic rules are designed to reduce the number of false positives. Experimental results on five evaluation software projects demonstrate that the proposed method can identify configuration types for each software, with a mean average precision of 0.86 and an average time overhead of 21 minutes, thus preliminarily possessing the ability to replace manual identification.

Key words: software configuration, configuration type identification, empirical research, multi-granularity code features, isolation forest, configuration methods

刘源, 刘大伟, 张玉秀, 吴明磊. 融合多粒度代码特征和孤立森林算法的配置类型识别[J]. 计算机工程与应用, 2025, 61(13): 185-199.

LIU Yuan, LIU Dawei, ZHANG Yuxiu, WU Minglei. Configuration Type Identification Integrating Multi-Granularity Code Features and Isolation Forest Algorithm[J]. Computer Engineering and Applications, 2025, 61(13): 185-199.

参考文献

[1] HAN X, YU T T, PRADEL M. ConfProf: white-box performance profiling of configuration options[C]//Proceedings of the ACM/SPEC International Conference on Performance Engineering. New York: ACM, 2021: 1-8.
[2] 张弛, 司徒凌云, 王林章. 物联网固件安全缺陷检测研究进展[J]. 信息安全学报, 2021, 6(3): 141-158.
ZHANG C, SITU L Y, WANG L Z. Research progress on security defect detection of IoT firmware[J]. Journal of Cyber Security, 2021, 6(3): 141-158.
[3] XU T Y, ZHOU Y Y. Systems approaches to tackling configuration errors[J]. ACM Computing Surveys, 2015, 47(4): 1-41.
[4] 周书林, 李姗姗, 董威, 等. 软件运行时配置研究综述[J]. 软件学报, 2024, 35(1): 63-86.
ZHOU S L, LI S S, DONG W, et al. Survey on software runtime configuration researches[J]. Journal of Software, 2024, 35(1): 63-86.
[5] 陈伟, 黄翔, 乔晓强, 等. 软件配置错误诊断与修复技术研究[J]. 软件学报, 2015, 26(6): 1285-1305.
CHEN W, HUANG X, QIAO X Q, et al. Research on software misconfiguration troubleshooting[J]. Journal of Software, 2015, 26(6): 1285-1305.
[6] ZHOU S L, LIU X D, LI S S, et al. ConfInLog: leveraging software logs to infer configuration constraints[C]//Proceedings of the 2021 IEEE/ACM 29th International Conference on Program Comprehension. Piscataway: IEEE, 2021: 94-105.
[7] ZHANG Y L, HE H C, LEGUNSEN O, et al. An evolutionary study of configuration design and implementation in cloud systems[C]//Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering. Piscataway: IEEE, 2021: 188-200.
[8] 陈艳, 叶宏杰, 陈伟. 软件系统配置研究综述[J]. 计算机系统应用, 2021, 30(7): 1-12.
CHEN Y, YE H J, CHEN W. Survey on software system configuration[J]. Computer Systems & Applications, 2021, 30(7): 1-12.
[9] RABKIN A, KATZ R. Static extraction of program configuration options[C]//Proceedings of the 2011 33rd International Conference on Software Engineering. Piscataway: IEEE, 2011: 131-140.
[10] DONG Z, ANDRZEJAK A, LO D, et al. ORPLocator: identifying read points of configuration options via static analysis[C]//Proceedings of the 2016 IEEE 27th International Symposium on Software Reliability Engineering. Piscataway: IEEE, 2016: 185-195.
[11] ZHANG S, ERNST M D. Which configuration option should I change?[C]//Proceedings of the 36th International Conference on Software Engineering. New York: ACM, 2014: 152-163.
[12] JIN D P, COHEN M B, QU X, et al. PrefFinder: getting the right preference in configurable software systems[C]//Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering. New York: ACM, 2014: 151-162.
[13] BEHRANG F, COHEN M B, ORSO A. Users beware: preference inconsistencies ahead[C]//Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. New York: ACM, 2015: 295-306.
[14] XU T, JIN X, HUANG P, et al. Early detection of configuration errors to reduce failure damage[C]//Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, 2016: 619-634.
[15] CHEN Z M, CHEN P F, WANG P P, et al. DiagConfig: configuration diagnosis of performance violations in configurable software systems[C]//Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York: ACM, 2023: 566-578.
[16] XU G Q, DING X R, XU S H, et al. Real-time diagnosis of configuration errors for software of AI server infrastructure[J]. IEEE Transactions on Dependable and Secure Computing, 2023, PP(99): 1-12.
[17] LIU F T, TING K M, ZHOU Z H. Isolation-based anomaly detection[J]. ACM Transactions on Knowledge Discovery from Data, 2012, 6(1): 1-39.
[18] 刘弋, 吴毅坚, 彭鑫, 等. 基于图模型和孤立森林的上帝类检测方法[J]. 软件学报, 2022, 33(11): 4046-4060.
LIU Y, WU Y J, PENG X, et al. God class detection approach based on graph model and isolation forest[J]. Journal of Software, 2022, 33(11): 4046-4060.
[19] ZOU Z P, XIE Y L, HUANG K, et al. A docker container anomaly monitoring system based on optimized isolation forest[J]. IEEE Transactions on Cloud Computing, 2019, 10(1): 134-145.
[20] LI S T, ZHANG K Z, DUAN P H, et al. Hyperspectral anomaly detection with kernel isolation forest[J]. IEEE Transactions on Geoscience and Remote Sensing, 2019, 58(1): 319-329.
[21] HAIDUC S, APONTE J, MORENO L, et al. On the use of automated text summarization techniques for summarizing source code[C]//Proceedings of the 2010 17th Working Conference on Reverse Engineering. Piscataway: IEEE, 2010: 35-44.
[22] RASTKAR S, MURPHY G C, BRADLEY A W J. Generating natural language summaries for crosscutting source code concerns[C]//Proceedings of the 2011 27th IEEE International Conference on Software Maintenance. Piscataway: IEEE, 2011: 103-112.
[23] HAIDUC S, APONTE J, MARCUS A. Supporting program comprehension with source code summarization[C]//Proceedings of the 2010 ACM/IEEE 32nd International Conference on Software Engineering. Piscataway: IEEE, 2010: 223-226.
[24] LIU M W, PENG X, MENG X J, et al. Source code based on-demand class documentation generation[C]//Proceedings of the 2020 IEEE International Conference on Software Maintenance and Evolution. Piscataway: IEEE, 2020: 864-865.
[25] MALHOTRA M, KUMAR CHHABRA J. Class level code summarization based on dependencies and micro patterns[C]//Proceedings of the 2018 Second International Conference on Inventive Communication and Computational Technologies. Piscataway: IEEE, 2018: 1011-1016.
[26] GIL J Y, MAMAN I. Micro patterns in Java code[C]//Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications. New York: ACM, 2005: 97-116.
[27] DRAGAN N, COLLARD M L, MALETIC J I. Reverse engineering method stereotypes[C]//Proceedings of the 2006 22nd IEEE International Conference on Software Maintenance. Piscataway: IEEE, 2006: 24-34.
[28] DRAGAN N, COLLARD M L, MALETIC J I. Automatic identification of class stereotypes[C]//Proceedings of the 2010 IEEE International Conference on Software Maintenance. Piscataway: IEEE, 2010: 1-10.
[29] MORENO L, APONTE J, SRIDHARA G, et al. Automatic generation of natural language summaries for Java classes[C]//Proceedings of the 2013 21st International Conference on Program Comprehension. Piscataway: IEEE, 2013: 23-32.
[30] RUNESON P, H?ST M. Guidelines for conducting and reporting case study research in software engineering[J]. Empirical Software Engineering, 2009, 14(2): 131-164.
[31] MCMILLAN C, GRECHANIK M, POSHYVANYK D. Detecting similar software applications[C]//Proceedings of the 2012 34th International Conference on Software Engineering. Piscataway: IEEE, 2012: 364-374.
[32] BUTLER S, WERMELINGER M, YU Y J, et al. Mining Java class naming conventions[C]//Proceedings of the 2011 27th IEEE International Conference on Software Maintenance. Piscataway: IEEE, 2011: 93-102.
[33] NEWMAN C D, ALSUHAIBANI R S, DECKER M J, et al. On the generation, structure, and semantics of grammar patterns in source code identifiers[J]. Journal of Systems and Software, 2020, 170: 110740.
[34] HU X, LI G, XIA X, et al. Deep code comment generation[C]//Proceedings of the 26th Conference on Program Comprehension. New York: ACM, 2018: 200-210.
[35] ROSE S, ENGEL D, CRAMER N, et al. Automatic keyword extraction from individual documents[J]. Text Mining: Applications and Theory, 2010: 1-20.
[36] ALLAMANIS M, BARR E T, DEVANBU P, et al. A survey of machine learning for big code and naturalness[J]. ACM Computing Surveys, 2019, 51(4): 1-37.
[37] SALTON G, BUCKLEY C. Term-weighting approaches in automatic text retrieval[J]. Information Processing & Management, 1988, 24(5): 513-523.
[38] JACCARD P. The distribution of the flora in the alpine zone[J]. New Phytologist, 1912, 11(2): 37-50.
[39] FERNáNDEZ A, GARCíA S, DEL JESUS M J, et al. A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets[J]. Fuzzy Sets and Systems, 2008, 159(18): 2378-2398.
[40] NG W W Y, HU J J, YEUNG D S, et al. Diversified sensitivity-based undersampling for imbalance classification problems[J]. IEEE Transactions on Cybernetics, 2015, 45(11): 2402-2412.
[41] KHAN S H, HAYAT M, BENNAMOUN M, et al. Cost-sensitive learning of deep feature representations from imbalanced data[J]. IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(8): 3573-3587.
[42] YANG P Y, YOO P D, FERNANDO J, et al. Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications[J]. IEEE Transactions on Cybernetics, 2014, 44(3): 445-455.
[43] JING X Y, ZHANG X Y, ZHU X K, et al. Multiset feature learning for highly imbalanced data classification[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(1): 139-156.
[44] RAGHOTHAMAN M, WEI Y, HAMADI Y. SWIM: synthesizing what I mean: code search and idiomatic snippet synthesis[C]//Proceedings of the 38th International Conference on Software Engineering. New York: ACM, 2016: 357-367.
[45] GU X D, ZHANG H Y, KIM S. Deep code search[C]//Proceedings of the 40th International Conference on Software Engineering. New York: ACM, 2018: 933-944.