计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (14): 135-138.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

改进的K-means算法在维文连体段聚类中的应用

张建周,哈力木拉提·买买提,陈晓娇   

  1. 新疆大学 信息科学与工程学院 多语种信息技术重点实验室,乌鲁木齐 830046
  • 出版日期:2014-07-15 发布日期:2014-08-04

Application of improved K-means algorithm in Uyghur word-part clustering

ZHANG Jianzhou, Halmurat·Mamat, CHEN Xiaojiao   

  1. Key Lab of Multilanguage Information Technology, School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
  • Online:2014-07-15 Published:2014-08-04

摘要: 在维吾尔文文字识别中,能否有效地聚类将直接影响识别结果的好坏。为改善聚类效果,针对维吾尔文连体段聚类,提出了一种改进的K-means聚类算法。该算法首先采用等间距法多次选择类中心,然后选择最佳码本和利用有效相似比来动态调整聚类个数K,最后完成了连体段聚类。实验结果表明:与传统K-means算法相比,改进的K-means算法得到了较好聚类效果,聚类正确率达90%以上。

关键词: 维吾尔文文字识别, 连体段, 聚类算法, 等间距法, 有效相似比, 正确率

Abstract: In Uyghur character recognition, the effect of the cluster will affect the recognition rate directly. To improve the clustering result, an improved K-means clustering algorithm based on Uyghur word-part is presented. The first step of the method is to select the center of the clustering by using the equal interval method repeatedly in order to select the best codebook, then adjust the number of clustering classes (noted as K) by using an effective similarity ratio dynamically. Finally, the word-part clustering is completed. The experimental results show that:compared with the traditional K-means algorithm, the improved K-means algorithm gets a better result and the clustering accuracy is more than 90%.

Key words: Uyghur character recognition, word-part, clustering algorithm, equal interval method, effective similarity ratio, accuracy