Computer Engineering and Applications ›› 2010, Vol. 46 ›› Issue (8): 123-126.DOI: 10.3778/j.issn.1002-8331.2010.08.035

• 数据库、信号与信息处理 • Previous Articles     Next Articles

Semi_supervised clustering algorithm based on seeds set and frequent itemset mining

ZHAO Qian,SHANG Xue-qun,WANG Miao   

  1. School of Computer,Northwestern Polytechnical University,Xi’an 710072,China
  • Received:2008-09-18 Revised:2008-12-04 Online:2010-03-11 Published:2010-03-11
  • Contact: ZHAO Qian

基于seeds集和频繁项集挖掘的半监督聚类算法

赵 倩,尚学群,王 淼   

  1. 西北工业大学 计算机学院,西安 710072
  • 通讯作者: 赵 倩

Abstract: Semi_supervised clustering makes use of few supervised information in unsupervised clustering to boost the clustering performance.This paper proposes a semi_supervised clustering algorithm based on seeds set and frequent itemset mining,which mines frequent itemsets in the beginning seeds set and the enlarged seeds set for eliminating the noise data and correcting the mislabeled data to improve the quality of seeds set and enhance the performance of clustering.A weighted χ2 measure,as a classification rule evaluation measure,is used to label unlabeled data and they are added into the initial seeds set to enlarge the scale.The experimental results show that the proposed approach effectively reduces the noise data,and not only makes the results more correct but also makes the performance of clustering more better.

摘要: 半监督聚类在无监督学习中通过对少量监督信息的有效利用提高聚类性能。提出一种基于seeds集的半监督聚类算法,它采用Apiori算法对初始seeds集和扩大规模后seeds集的数据进行频繁项集挖掘,使得数据中存在的噪音数据和误标记数据得到净化、修正,以改善seeds集质量,提高聚类性能。该算法使用带权χ2测试这一数学模型作为分类规则度量指标,以对无标记数据进行类标签值预测。实验结果显示,所提出的结合了频繁项集挖掘和带权χ2测试的基于seeds集的半监督聚类算法不仅改善了seeds集质量,也提高了预测结果的精确度,优化了聚类性能。

CLC Number: