基于余弦距离选取初始簇中心的文本聚类研究

doi:10.3778/j.issn.1002-8331.1802-0108

计算机工程与应用 ›› 2018, Vol. 54 ›› Issue (10): 11-18.DOI: 10.3778/j.issn.1002-8331.1802-0108

基于余弦距离选取初始簇中心的文本聚类研究

王彬宇1，刘文芬2，胡学先1，魏江宏1

1.数学工程与先进计算国家重点实验室，郑州 450000
2.桂林电子科技大学广西密码学与信息安全重点实验室，广西桂林 541000

出版日期:2018-05-15 发布日期:2018-05-28

Research on text clustering for selecting initial cluster center based on Cosine distance

WANG Binyu1, LIU Wenfen2, HU Xuexian1, WEI Jianghong1

1.State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450000, China
2.Guangxi Key Laboratory of Cryptography and Information Security, Guilin University of Electronic Technology, Guilin, Guangxi 541000, China

Online:2018-05-15 Published:2018-05-28

摘要/Abstract

摘要： 文本聚类是文本信息进行有效组织、摘要和导航的重要手段，其中基于余弦相似度的K-means算法是最重要且使用最广泛的文本聚类算法之一。针对基于余弦相似度的K-means算法改进方案设计困难，且众多优异的基于欧氏距离的K-means改进方法无法适用的问题，对余弦相似度与欧氏距离的关系进行探讨，得到标准向量前提下二者的转化公式，并在此基础上定义一种与欧氏距离意义相近关系紧密的余弦距离，使原有基于欧氏距离的K-means改进方法可通过余弦距离迁移到基于余弦相似度的K-means算法中。在此基础上理论推导出余弦K-means算法及其拓展算法的簇内中心点计算方法，并进一步改进了聚类初始簇中心的选取方案，形成新的文本聚类算法MCSKM++。通过实验验证，该算法在迭代次数减少、运行时间缩短的同时，聚类精度得到提高。

关键词: 文本聚类, K-means算法, 余弦相似度, 余弦距离, 初始点选取

Abstract: Text clustering is an important means for text information to be organized, abstracted and navigated effectively, in which K-means algorithm based on cosine similarity is one of the most widely used algorithms. Aiming at the problem that the K-means algorithm based on cosine similarity is difficult to be improved, and that many excellent K-means improvement methods based on Euclidean distance can not be applied, the relationship between cosine similarity and Euclidean distance is discussed, and the transformation formula of the two is obtained with standard vector. Thus, a definition of cosine distance is given, which is close to the Euclidean distance, so that the original improved K-means method based on Euclidean distance can be transformed into a cosine similarity K-means algorithm by cosine distance. On this basis, it is deduced the calculation method of cluster center points in cosine K-means algorithm, and the initial point selection scheme is further improved to form a new text clustering algorithm MCSKM++. The experimental results show that the algorithm can improve the clustering accuracy while the number of iterations is reduced and the running time is shortened.

Key words: text clustering, K-means algorithm, cosine similarity, cosine distance, initial point selection

王彬宇1，刘文芬2，胡学先1，魏江宏1. 基于余弦距离选取初始簇中心的文本聚类研究[J]. 计算机工程与应用, 2018, 54(10): 11-18.

WANG Binyu1, LIU Wenfen2, HU Xuexian1, WEI Jianghong1. Research on text clustering for selecting initial cluster center based on Cosine distance[J]. Computer Engineering and Applications, 2018, 54(10): 11-18.

[1]	霍光煜，张勇，孙艳丰，尹宝才. 基于语义的档案数据智能分类方法研究[J]. 计算机工程与应用, 2021, 57(6): 247-253.
[2]	胡晓敏，王明丰，张首荣，李敏. 用于文本聚类的新型差分进化粒子群算法[J]. 计算机工程与应用, 2021, 57(4): 61-67.
[3]	潘成胜，张斌，吕亚娜，杜秀丽，邱少明. 改进灰狼优化算法的K-Means文本聚类[J]. 计算机工程与应用, 2021, 57(1): 188-193.
[4]	张卫卫，胡亚琦，翟广宇，刘志鹏. 基于LDA模型和Doc2vec的学术摘要聚类方法[J]. 计算机工程与应用, 2020, 56(6): 180-185.
[5]	魏玮，张芯月，朱叶. 改进的SIFT结合余弦相似度的人脸匹配算法[J]. 计算机工程与应用, 2020, 56(6): 207-212.
[6]	王子龙，李进，宋亚飞. 基于距离和权重改进的K-means算法[J]. 计算机工程与应用, 2020, 56(23): 87-94.
[7]	张震，李浩方，李孟州. YOLO算法在安检异常图像中的研究[J]. 计算机工程与应用, 2020, 56(21): 187-193.
[8]	张云纯，张琨，徐济铭，袁卫平，蔡颖，高雅. 基于图模型的多文档摘要生成算法[J]. 计算机工程与应用, 2020, 56(16): 124-131.
[9]	李峰，李明祥，张宇敬. 局部迭代的快速K-means聚类算法[J]. 计算机工程与应用, 2020, 56(13): 63-71.
[10]	胡正平，王欣，孙德刚. 余弦权重逆稀疏框架视频目标跟踪算法[J]. 计算机工程与应用, 2019, 55(21): 206-213.
[11]	王昕宇，罗可. 具有全局记忆的LF蚁群聚类算法[J]. 计算机工程与应用, 2019, 55(20): 52-57.
[12]	周晓宇，张龙波，王雷，李鑫翔. 结合谱聚类与改进RSF模型的医学图像分割[J]. 计算机工程与应用, 2019, 55(15): 193-197.
[13]	马菁1，2，李力3. RDD上扩展索引层优化的分布式K-means算法[J]. 计算机工程与应用, 2019, 55(1): 161-167.
[14]	向程谕，王冬丽，周彦，李雅芳. 基于RGB-D融合特征的图像分类[J]. 计算机工程与应用, 2018, 54(8): 178-182.
[15]	陈庆虎，周小丹，鄢煜尘. 基于字符图像分割的打印文件识别方法[J]. 计算机工程与应用, 2018, 54(7): 170-175.

基于余弦距离选取初始簇中心的文本聚类研究

Research on text clustering for selecting initial cluster center based on Cosine distance

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics