基于隐含语义分析的微博话题发现方法

计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (1): 96-100.

• 数据库、数据挖掘、机器学习 • 上一篇下一篇

基于隐含语义分析的微博话题发现方法

马雯雯1，魏文晗1，邓一贵1，2

1.重庆大学计算机学院，重庆 400044
2.重庆大学信息与网络管理中心，重庆 400044

出版日期:2014-01-01 发布日期:2013-12-30

Micro-blog topic detection method based on Latent Semantic Analysis

MA Wenwen1, WEI Wenhan1, DEGN Yigui1，2

1.School of Computer Science, Chongqing University, Chongqing 400044, China
2.Center of Information and Network, Chongqing University, Chongqing 400044, China

Online:2014-01-01 Published:2013-12-30

摘要/Abstract

摘要： 随着微博的大量普及和关注度的不断提高，微博热点话题发现已成为当前研究热点。针对于短文本、向量空间模型（VSM）文本表示方法存在高维度、稀疏，以及同义多义问题，导致难以准确度量文本相似度，提出一种基于隐含语义分析的两阶段聚类话题发现方法。引入话题热度的概念来选取具有一定关注度的微博文本，用隐含语义分析（LSA）对数据集进行建模；用层次聚类的CURE算法确定初始类中心；用K-means聚类得到热点话题的聚类结果。真实微博数据集的实验结果验证了该方法的有效性。

关键词: 隐含语义分析, 向量空间模型, 话题发现, 微博, 两阶段聚类

Abstract: As the large popularity of micro-blog and awareness continues to improve, hot topic of micro-blog detecting has become the current research focuses. For short texts, there exist high-dimension, sparse, synonymy and polysemy problems for Vector Space Model（VSM） text presentation, making it difficult to measure the similarity of the texts accurately. This paper presents a two-stage cluster based on Latent Semantic Analysis（LSA） topic detection approach. Firstly, the concept of hot topic is introduced to select micro-blogs with certain attention, using LSA to model the dataset. Then CURE algorithm of hierarchical clustering is employed to determine the initial centers. Finally, the hot topic clustering results are obtained through K-means clustering. Experimental results on real micro-blog dataset verify the validity of the method.

Key words: Latent Semantic Analysis（LSA）, Vector Space Model（VSM）, topic detection, micro-blog, two-stage clustering

马雯雯1，魏文晗1，邓一贵1，2. 基于隐含语义分析的微博话题发现方法[J]. 计算机工程与应用, 2014, 50(1): 96-100.

MA Wenwen1, WEI Wenhan1, DEGN Yigui1，2. Micro-blog topic detection method based on Latent Semantic Analysis[J]. Computer Engineering and Applications, 2014, 50(1): 96-100.

[1]	赵圆丽，梁志剑. 基于异核卷积双注意机制的立场检测研究[J]. 计算机工程与应用, 2021, 57(8): 119-125.
[2]	吴迪，张梦甜，生龙，黄竹韵，顾明星. 改进在线词对主题模型的微博热点话题演化[J]. 计算机工程与应用, 2021, 57(24): 179-184.
[3]	沈瑞琳，潘伟民，彭成，尹鹏博. 基于多任务学习的微博谣言检测方法[J]. 计算机工程与应用, 2021, 57(24): 192-197.
[4]	李东昊，杨文忠，仲丽君，张志豪，王雪颖. 基于重点博文的突发事件检测方法[J]. 计算机工程与应用, 2020, 56(4): 175-183.
[5]	韩邦，李子臣，汤永利. 基于同态加密的全文检索方案设计与实现[J]. 计算机工程与应用, 2020, 56(21): 103-107.
[6]	叶雪梅1，2，毛雪岷1，2，夏锦春1，2，王波1，2. 文本分类TF-IDF算法的改进研究[J]. 计算机工程与应用, 2019, 55(2): 104-109.
[7]	李鹏飞1，董旭1，仲兆满2，3，李存华2. 基于微博用户兴趣话题的相似用户挖掘[J]. 计算机工程与应用, 2019, 55(11): 102-109.
[8]	高永兵1，张贵娟1，胡文江1，马占飞2. 基于后缀树算法的地区微博摘要技术研究[J]. 计算机工程与应用, 2018, 54(9): 126-132.
[9]	向广利，李安康，林香，熊彬. 基于同态加密的多关键词检索方案[J]. 计算机工程与应用, 2018, 54(2): 97-101.
[10]	冯旭鹏1，马震1，谢波1，刘利军2，黄青松2. 基于聚类集成的微博话题发现方法[J]. 计算机工程与应用, 2017, 53(8): 81-86.
[11]	刘琰，张进，陈静，尹美娟，张伟丽. 基于最大频繁项集挖掘的微博炒作群体发现方法[J]. 计算机工程与应用, 2017, 53(4): 90-97.
[12]	奠雨洁，金琴，吴慧敏. 基于多文本特征融合的中文微博的立场检测[J]. 计算机工程与应用, 2017, 53(21): 77-84.
[13]	张绍阳，曹家波，王子凡，曲卫东. 基于加权二部图匹配的中文段落相似度计算[J]. 计算机工程与应用, 2017, 53(18): 95-101.
[14]	陈红阳，汪林林，鲁江坤，唐志，王飞雪. 基于双态模型的微博话题跟踪方法研究[J]. 计算机工程与应用, 2017, 53(16): 144-148.
[15]	朱金奇1，2，张兆年1，马春梅1，刘念伯2，鲁力2. 基于地理近邻关系的微博系统朋友推荐[J]. 计算机工程与应用, 2017, 53(13): 72-77.

基于隐含语义分析的微博话题发现方法

Micro-blog topic detection method based on Latent Semantic Analysis

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics