计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (13): 185-199.DOI: 10.3778/j.issn.1002-8331.2408-0377

• 理论与研发 • 上一篇    下一篇

融合多粒度代码特征和孤立森林算法的配置类型识别

刘源,刘大伟,张玉秀,吴明磊   

  1. 山东工商学院 计算机科学与技术学院,山东 烟台 264005
  • 出版日期:2025-07-01 发布日期:2025-06-30

Configuration Type Identification Integrating Multi-Granularity Code Features and Isolation Forest Algorithm

LIU Yuan, LIU Dawei, ZHANG Yuxiu, WU Minglei   

  1. School of Computer Science and Technology, Shandong Technology and Business University, Yantai, Shandong 264005, China
  • Online:2025-07-01 Published:2025-06-30

摘要: “高内聚、低耦合”设计原则的普及应用,使得代码中通常存在着专门管理配置选项或配置方法的特殊类型,称为配置类型。配置类型有助于研究人员从属性角度和行为角度增进对配置机制的理解,并为配置错误处理技术提供必要的选项集合以及选项数据流信息。然而,配置类型研究尚不充分,其识别仍依赖于人工检索。提出一种融合多粒度代码特征和孤立森林算法的配置类型识别方法。基于10个具有代表性的开源软件,手动构建配置类型数据集,通过实证调研配置类型的分布、分类和识别影响因素,总结得到9个调研结果,用于指导配置类型识别。基于调研结果,选取覆盖代码词汇、结构、语义和语法信息的4个类型级粗粒度特征和3个方法级细粒度特征,并为每个特征设计量化算法。考虑到配置类型存在样本类别分布不平衡问题,将识别问题转化为异常检测问题,利用孤立森林算法推荐配置类型,同时设计启发规则减少误报数量。在5个评估软件上的实验结果表明,该方法能识别出每个软件的配置类型,平均精度均值为0.86,平均时间开销为21 min,已初步具备代替人工识别的能力。

关键词: 软件配置, 配置类型识别, 实证调研, 多粒度代码特征, 孤立森林, 配置方法

Abstract: The widespread application of the design principle of “high cohesion and low coupling” has led to types in source code, which are dedicated to managing configuration options or methods, called configuration types. Configuration types help researchers to understand configuration mechanisms from both attribute and behavioral perspectives, and provide the necessary option set and option data flow information for configuration error-handling techniques. However, the research on configuration types is not enough, and their identification still relies on manual retrieval. A configuration type identification method that integrates multi-granularity code features and the isolation forest algorithm is proposed to address the above issue. First, a configuration type dataset is manually constructed for ten representative open-source software pro-jects. Through empirical research on the distribution, classification, and factors influencing the identification of configuration types, nine results are summarized to guide the identification of configuration types. Then, based on the research results, four type-level coarse-grained features and three method-level fine-grained features covering code vocabulary, structure, semantics and syntax information are selected, and a quantization algorithm is designed for each feature. Finally, considering the imbalanced sample category distribution of configuration types, the identification is transformed into an anomaly detection. The isolation forest algorithm is utilized to recommend configuration types, while heuristic rules are designed to reduce the number of false positives. Experimental results on five evaluation software projects demonstrate that the proposed method can identify configuration types for each software, with a mean average precision of 0.86 and an average time overhead of 21 minutes, thus preliminarily possessing the ability to replace manual identification.

Key words: software configuration, configuration type identification, empirical research, multi-granularity code features, isolation forest, configuration methods