Computer Engineering and Applications ›› 2020, Vol. 56 ›› Issue (22): 166-172.DOI: 10.3778/j.issn.1002-8331.1908-0425

Previous Articles     Next Articles

Random Forest Optimization Method Based on Cluster Undersampling Strategy

LUO Jigen, DU Jianqiang, NIE Bin, LI Huan, NIE Jianhua, CHEN Yufeng   

  1. 1.School of Computer Science, Jiangxi University of Traditional Chinese Medicine, Nanchang 330004, China
    2.School of Chinese Medicine, Jiangxi University of Traditional Chinese Medicine, Nanchang 330004, China
  • Online:2020-11-15 Published:2020-11-13

一种聚类欠采样策略的随机森林优化方法

罗计根,杜建强,聂斌,李欢,聂建华,陈裕凤   

  1. 1.江西中医药大学 计算机学院,南昌 330004
    2.江西中医药大学 中医学院,南昌 330004

Abstract:

Aiming at the random forest classification effect, which is affected by the imbalance between sample sets and intra-class irregularities, this paper proposes a random forest optimization method based on cluster undersampling strategy. The method clusters the original data large sample, and obtains the same sub-class cluster as the small-class sample. From each sub-cluster, a sample is randomly selected and merged with the small-class sample to form a balanced sample set. The sample set is subjected to returning random sampling to form a training sample set of a single decision tree and completing the construction. Some samples are not extracted twice before and after, it will be used as out of bag data for model testing. The above process is repeated multiple times to form a random forest. Experiments are carried out by using 10 sets of unbalanced data sets. The results show that the classification ability and stability of the method on these 10 sets of data sets are better than traditional random forests.

Key words: random forest, unbalanced data, cluster analysis, Chinese medicine informatics

摘要:

针对随机森林分类效果受样本集类间不平衡、类内不规则的影响,提出一种聚类欠采样策略的随机森林优化方法。该方法对原始数据大类样本聚类,得到与小类样本个数相同的子类簇;从每个子类簇中随机有放回抽取一个样本与小类样本合并,形成平衡样本集;对平衡样本集进行有放回随机抽样,形成单棵决策树的训练样本集并完成建树;将两次未被抽中的样本作为袋外数据,用于模型测试;重复上述过程多次,形成随机森林。使用10组非平衡数据集进行实验验证,结果表明,该方法在这10组数据集上的分类能力及稳定性均优于传统随机森林。

关键词: 随机森林, 非平衡数据, 聚类分析, 中医药信息学