微博文本聚类中特征扩展策略研究

doi:10.3778/j.issn.1002-8331.1606-0438

计算机工程与应用 ›› 2017, Vol. 53 ›› Issue (13): 90-94.DOI: 10.3778/j.issn.1002-8331.1606-0438

微博文本聚类中特征扩展策略研究

段旭磊，张仰森，郭正斌

北京信息科技大学智能信息处理研究所，北京 100192

出版日期:2017-07-01 发布日期:2017-07-12

Feature extension of cluster analysis based on Microblog

DUAN Xulei, ZHANG Yangsen, GUO Zhengbin

Institute of Intelligence Information Processing, Beijing information Science and Technology University, Beijing 100192, China

Online:2017-07-01 Published:2017-07-12

摘要/Abstract

摘要： 针对微博文本高维、稀疏的特点，比较基于同义词词林等外部知识库的文本扩展策略，利用Word2vec训练微博语料，并构建微博上下文相关词词表，通过种子词表和微博标签信息去扩展微博文本流中的关键词，最后提出了提取微博文本关键词及区分词向量中相似词和相关词的方法。实验结果证明，微博短文本经过Word2vec词向量相关词及微博标签扩展后，其聚类效果有了明显提高。

关键词: 微博文本, 高维稀疏, 关键词提取, 相似词, 相关词, 特征扩展, 聚类

Abstract: Microblog has become the soil of information generated and spread today. But the information in the Microblog is different from the news Web page or blog information. In the Microblog, these characteristics, which the texts are high-dimensional and sparse, bring great challenges to the Microblog text processing. According to the characteristics of Microblog, this paper compares the methods that the expansion strategy of short text based on HowNet and Cilin, it proposes that using Word2vec to train the corpus of Microblog, and constructs a related vocabulary words of the Microblog context, then uses the seed words and Microblog label information to expand Microblog text, and puts forward the methods of extracting Microblog text keywords and distinguishing the similar words and related words. Finally, the experiments show that by using the Word2vec to extend Microblog is better, and the effect of cluster analysis for Microblog text has been significantly improved.

Key words: Microblog text, high dimension and sparse, keyword extraction, similar words, related words, feature expansion, clustering

段旭磊，张仰森，郭正斌. 微博文本聚类中特征扩展策略研究[J]. 计算机工程与应用, 2017, 53(13): 90-94.

DUAN Xulei, ZHANG Yangsen, GUO Zhengbin. Feature extension of cluster analysis based on Microblog[J]. Computer Engineering and Applications, 2017, 53(13): 90-94.

[1]	兰红，黄敏. 融合KNN优化的密度峰值和FCM聚类算法[J]. 计算机工程与应用, 2021, 57(9): 81-88.
[2]	郭晓静，隋昊达. 改进YOLOv3在机场跑道异物目标检测中的应用[J]. 计算机工程与应用, 2021, 57(8): 249-255.
[3]	李莉，纪欣沅，宋嵩. 回环软件缺陷数量预测模型[J]. 计算机工程与应用, 2021, 57(7): 158-163.
[4]	霍光煜，张勇，孙艳丰，尹宝才. 基于语义的档案数据智能分类方法研究[J]. 计算机工程与应用, 2021, 57(6): 247-253.
[5]	杨芳，尹曦，司建辉，刘宏媛，汪雪. 基于侧重点聚类的数学表达式相似度计算方法[J]. 计算机工程与应用, 2021, 57(6): 88-93.
[6]	赵凡，张琳，闻治泉，杨林林，蔺广逢. 一种直接高效的自然场景汉字逼近定位方法[J]. 计算机工程与应用, 2021, 57(6): 159-167.
[7]	彭启慧，宣士斌，高卿. 分布的自动阈值密度峰值聚类算法[J]. 计算机工程与应用, 2021, 57(5): 71-78.
[8]	李勇振，廖湖声. 基于图卷积神经网络的多视角聚类[J]. 计算机工程与应用, 2021, 57(5): 115-122.
[9]	王昌龙，张远东，缪宏，杨煜恒. 双通道卷积神经网络在南瓜病害识别上的应用[J]. 计算机工程与应用, 2021, 57(5): 183-189.
[10]	胡晓敏，王明丰，张首荣，李敏. 用于文本聚类的新型差分进化粒子群算法[J]. 计算机工程与应用, 2021, 57(4): 61-67.
[11]	王俊玲，卢新明. 基于语义相关的视频关键帧提取算法[J]. 计算机工程与应用, 2021, 57(4): 192-198.
[12]	王芙银，张德生，张晓. 结合鲸鱼优化算法的自适应密度峰值聚类算法[J]. 计算机工程与应用, 2021, 57(3): 94-102.
[13]	陈俊丰，郑中团. WKMeans与SMOTE结合的不平衡数据过采样方法[J]. 计算机工程与应用, 2021, 57(23): 106-112.
[14]	张忠林，赵昱，闫光辉. 自然邻居密度极值聚类算法[J]. 计算机工程与应用, 2021, 57(23): 200-210.
[15]	梅婕，魏圆圆，许桃胜. 基于密度峰值多起始中心的融合聚类算法[J]. 计算机工程与应用, 2021, 57(22): 78-85.

微博文本聚类中特征扩展策略研究

Feature extension of cluster analysis based on Microblog

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics