Computer Engineering and Applications ›› 2022, Vol. 58 ›› Issue (7): 87-96.DOI: 10.3778/j.issn.1002-8331.2103-0520

• Theory, Research and Development • Previous Articles     Next Articles

Multi-Instance Ensemble Algorithm Combined with Fuzzy Clustering

HAN Haiyun, YANG Youlong, SUN Liqin   

  1. School of Mathematics and Statistics, Xidian University, Xi’an 710126, China
  • Online:2022-04-01 Published:2022-04-01

结合模糊聚类的多示例集成算法

韩海韵,杨有龙,孙丽芹   

  1. 西安电子科技大学 数学与统计学院,西安 710126

Abstract: To solve the problem that many algorithms make assumptions about the proportion of positive instances in the positive bags, a multi-instance ensemble algorithm combined with fuzzy clustering is proposed. Firstly, combining the fuzzy clustering and the characteristics of negative bags in multi-instance learning, the concept of positive score is proposed to measure the possibility of instance’s label being positive, which can reduce the ambiguity of instance’s label in multi-instance learning. Then, considering that it is more costly to classify negative instances incorrectly in multi-instance learning, an instance selection strategy of bag representative is designed, and the selected representative instances are used as the training subsets of the base classifiers. Finally, the results of each base classifier are combined to determine the final label of the bag. The ISFC algorithm does not make any assumption about the proportion of positive instances in positive bags, and can solve the class imbalanced problem when the number of positive bags is large and the number of negative bags is small. Experimental results show that ISFC has achieved good classification effect in drug molecular activity prediction, image classification, and text classification tasks.

Key words: multi-instance learning, fuzzy clustering, random subspace, instance selection, ensemble learning

摘要: 针对许多多示例算法都对正包中的示例情况做出假设的问题,提出了结合模糊聚类的多示例集成算法(ISFC)。结合模糊聚类和多示例学习中负包的特点,提出了“正得分”的概念,用于衡量示例标签为正的可能性,降低了多示例学习中示例标签的歧义性;考虑到多示例学习中将负示例分类错误的代价更大,设计了一种包的代表示例选择策略,选出的代表示例作为基分类器的训练子集;结合各基分类器的结果,确定包的最终标签。ISFC算法对正包中正示例的比例未做任何假设,同时能够解决正包数量多、负包数量少情况下的类别不平衡问题。实验结果表明,ISFC在药物分子活性预测、图像分类、文本分类任务上都取得了较好的分类效果。

关键词: 多示例学习, 模糊聚类, 随机子空间, 示例选择, 集成学习