Computer Engineering and Applications ›› 2020, Vol. 56 ›› Issue (22): 166-172.DOI: 10.3778/j.issn.1002-8331.1908-0425

Previous Articles     Next Articles

Random Forest Optimization Method Based on Cluster Undersampling Strategy

LUO Jigen, DU Jianqiang, NIE Bin, LI Huan, NIE Jianhua, CHEN Yufeng   

  1. 1.School of Computer Science, Jiangxi University of Traditional Chinese Medicine, Nanchang 330004, China
    2.School of Chinese Medicine, Jiangxi University of Traditional Chinese Medicine, Nanchang 330004, China
  • Online:2020-11-15 Published:2020-11-13



  1. 1.江西中医药大学 计算机学院,南昌 330004
    2.江西中医药大学 中医学院,南昌 330004


Aiming at the random forest classification effect, which is affected by the imbalance between sample sets and intra-class irregularities, this paper proposes a random forest optimization method based on cluster undersampling strategy. The method clusters the original data large sample, and obtains the same sub-class cluster as the small-class sample. From each sub-cluster, a sample is randomly selected and merged with the small-class sample to form a balanced sample set. The sample set is subjected to returning random sampling to form a training sample set of a single decision tree and completing the construction. Some samples are not extracted twice before and after, it will be used as out of bag data for model testing. The above process is repeated multiple times to form a random forest. Experiments are carried out by using 10 sets of unbalanced data sets. The results show that the classification ability and stability of the method on these 10 sets of data sets are better than traditional random forests.

Key words: random forest, unbalanced data, cluster analysis, Chinese medicine informatics



关键词: 随机森林, 非平衡数据, 聚类分析, 中医药信息学