结合语义改进的K-means短文本聚类算法

计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (19): 78-83.

结合语义改进的K-means短文本聚类算法

邱云飞，赵彬，林明明，王伟

辽宁工程技术大学软件学院，辽宁葫芦岛 125105

出版日期:2016-10-01 发布日期:2016-11-18

Improved K-means clustering algorithm combined semantic similarity of short text

QIU Yunfei, ZHAO Bin, LIN Mingming, WANG Wei

School of Software, Liaoning Technical University, Huludao, Liaoning 125105, China

Online:2016-10-01 Published:2016-11-18

摘要/Abstract

摘要： 针对短文本聚类存在的三个主要挑战，特征关键词的稀疏性、高维空间处理的复杂性和簇的可理解性，提出了一种结合语义改进的K-means短文本聚类算法。该算法通过词语集合表示短文本，缓解了短文本特征关键词的稀疏性问题；通过挖掘短文本集的最大频繁词集获取初始聚类中心，有效克服了K-means聚类算法对初始聚类中心敏感的缺点，解决了簇的理解性问题;通过结合TF-IDF值的语义相似度计算文档之间的相似度，避免了高维空间的运算。实验结果表明，从语义角度出发实现的短文本聚类算法优于传统的短文本聚类算法。

关键词: 文本挖掘, 短文本聚类, K-means算法, 最大频繁词集, 知网, 语义相似度

Abstract: Nowadays, there are three major challenges for short text clustering, the sparsity of feature key, the complexity of processing in high-dimensional space and the comprehensibility of clusters. For these challenges, a K-means clustering algorithm is proposed, which is improved by combining with semantic. Short text is described by collection of words in this algorithm, it alleviates the sparsity problem of characteristics of short text keywords. The clustering center can be obtained by mining the maximum frequent word set of short text collection, which effectively overcomes the defect that K-means clustering algorithm is sensitive to the initial clustering center, it solves the problem of the comprehensibility of clusters, and avoids the operation in high-dimensional space. The experimental results show that short text clustering algorithm combined with semantic is better than traditional algorithms.

Key words: text mining, clustering of short text, K-means algorithm, maximum frequent word set, HowNet, semantic similarity

邱云飞，赵彬，林明明，王伟. 结合语义改进的K-means短文本聚类算法[J]. 计算机工程与应用, 2016, 52(19): 78-83.

QIU Yunfei, ZHAO Bin, LIN Mingming, WANG Wei. Improved K-means clustering algorithm combined semantic similarity of short text[J]. Computer Engineering and Applications, 2016, 52(19): 78-83.

[1]	乔伟涛，黄海燕，王珊. 基于Transformer编码器的语义相似度算法研究[J]. 计算机工程与应用, 2021, 57(14): 158-163.
[2]	潘成胜，张斌，吕亚娜，杜秀丽，邱少明. 改进灰狼优化算法的K-Means文本聚类[J]. 计算机工程与应用, 2021, 57(1): 188-193.
[3]	张卫卫，胡亚琦，翟广宇，刘志鹏. 基于LDA模型和Doc2vec的学术摘要聚类方法[J]. 计算机工程与应用, 2020, 56(6): 180-185.
[4]	王子龙，李进，宋亚飞. 基于距离和权重改进的K-means算法[J]. 计算机工程与应用, 2020, 56(23): 87-94.
[5]	张震，李浩方，李孟州. YOLO算法在安检异常图像中的研究[J]. 计算机工程与应用, 2020, 56(21): 187-193.
[6]	李峰，李明祥，张宇敬. 局部迭代的快速K-means聚类算法[J]. 计算机工程与应用, 2020, 56(13): 63-71.
[7]	刘晨晖，张德生，胡钢. 基于TAKE的中文关键短语提取算法研究[J]. 计算机工程与应用, 2020, 56(10): 115-121.
[8]	葛妍娇，郭宇，黄少华，刘道元，张蓉. 基于智能感知网的物料配送动态优化方法[J]. 计算机工程与应用, 2019, 55(22): 212-218.
[9]	张新1，白马波1，王帆2. 认知网中基于联盟博弈的资源分配与功率控制[J]. 计算机工程与应用, 2019, 55(14): 76-82.
[10]	马菁1，2，李力3. RDD上扩展索引层优化的分布式K-means算法[J]. 计算机工程与应用, 2019, 55(1): 161-167.
[11]	黄诚1，2，刘嘉勇1，刘亮1，何祥1，汤殿华2. 基于上下文语义的恶意域名语料提取模型研究[J]. 计算机工程与应用, 2018, 54(9): 101-108.
[12]	高永兵1，张贵娟1，胡文江1，马占飞2. 基于后缀树算法的地区微博摘要技术研究[J]. 计算机工程与应用, 2018, 54(9): 126-132.
[13]	向程谕，王冬丽，周彦，李雅芳. 基于RGB-D融合特征的图像分类[J]. 计算机工程与应用, 2018, 54(8): 178-182.
[14]	陈庆虎，周小丹，鄢煜尘. 基于字符图像分割的打印文件识别方法[J]. 计算机工程与应用, 2018, 54(7): 170-175.
[15]	王彬宇1，刘文芬2，胡学先1，魏江宏1. 基于余弦距离选取初始簇中心的文本聚类研究[J]. 计算机工程与应用, 2018, 54(10): 11-18.