计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (16): 149-155.

• 模式识别与人工智能 • 上一篇    下一篇

具备迁移能力的类中心距离极大化聚类算法

孙寿伟,钱鹏江,陈爱国,蒋亦樟   

  1. 江南大学 数字媒体学院,江苏 无锡 214122
  • 出版日期:2016-08-15 发布日期:2016-08-12

Cluster-center-distance maximization clustering with knowledge transfer

SUN Shouwei, QIAN Pengjiang, CHEN Aiguo, JIANG Yizhang   

  1. School of Digital Media, Jiangnan University, Wuxi, Jiangsu 214122, China
  • Online:2016-08-15 Published:2016-08-12

摘要: 传统的聚类算法在以下两种情况下存在直接失效的风险:一是数据稀少或存在大量干扰数据;二是为了调控数据间的差异性,对数据集进行缩放。为了同时解决上述两个问题,提出了历史知识迁移准则与中心间距极大化准则,并将其运用到极大熵聚类算法中,称之为具备历史迁移能力的中心极大化聚类算法。算法有三大突出的优点:在当前数据稀少或存在污染时,算法有效利用了历史知识进行迁移学习,从而证明了较好的聚类有效性;在数据缩放到一定倍数时,传统聚类算法取得的类中心趋于一致,而算法利用类中心间距极大化准则,有效避免了类中心一致的问题;算法所利用的历史知识均不暴露历史源数据,因此算法具有良好的历史数据隐私保护效果。通过模拟数据集和真实数据集的实验,验证了算法的上述优点。

关键词: 迁移学习, 历史知识, 类中心间距极大, 隐私保护, 模糊聚类

Abstract: Traditional clustering algorithms are prone to being failure in two cases: The data are quite sparse or distorted by plenty of noise or outliers; To proportionally scale raw data in order to control the difference existing in eventual data. To address these issues, this paper first devises the history knowledge transfer as well as the maximum cluster-center-distance mechanisms, and then, combining these two mechanisms with the classical Maximum Entropy Clustering(MEC) approach, this paper proposes the center distance maximization clustering with historical knowledge transfer(HKT-CDMC for short). In general, the major merits of HKT-CDMC are three-fold: Benefiting from the guidance of historical knowledge, HKT-CDMC proves high effectiveness in the situations where the data are insufficient or distorted by much noise; After data scaling, the cluster centers obtained by those classical clustering methods are likely to be too close, HKT-CDMC, however, can effectively avoid this phenomenon via the maximum cluster-center-distance mechanism; As the historical knowledge cannot be mapped inversely into the raw data, HKT-CDMC is of good capability of privacy protection for the source domain. The experimental studies on both artificial and real-world datasets demonstrated these merits of our work.

Key words: transfer learning, historical knowledge, maximum cluster-center-distance, privacy protection, fuzzy clustering