Twitter中重复消息的分析和处理

计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (21): 111-115.

Twitter中重复消息的分析和处理

徐凯1，2，沙瀛2，李阳3，单既喜2，王晓岩2

1.江西农业大学计算机与信息工程学院，南昌 330045
2.中国科学院信息工程研究所，北京 100093
3.中国科学院计算技术研究所，北京 100190

出版日期:2014-11-01 发布日期:2014-10-28

Twitter repeat messages analysis and processing

XU Kai1，2, SHA Ying2, LI Yang3, SHAN Jixi2, WANG Xiaoyan2

1.School of Computer and Information Engineering, Jiangxi Agricultural University, Nanchang 330045, China
2.Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China
3.Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China

Online:2014-11-01 Published:2014-10-28

摘要/Abstract

摘要： Twitter已经成为微博中的代表性应用，但是通过分析发现twitter上的消息（推文）有很多完全一致或相似，这对后续对推文的分析和存储都带来很大的问题。为了处理这些内容完全一致或相似的消息（推文），针对推文特有的短文本的特点，基于规则处理完全一致的推文，采用simhash的方法来处理相似性的推文。实验采用实际抓取的240万条推文数据进行分析和处理，分别对中文和英文的推文重复情况进行了分析，实验结果发现重复的推文占总推文的10%左右。

关键词: 推特, 微博, Simhash, 短文本去重

Abstract: Twitter has become the representative applications of the micro-blog. By analysis on twitter a lot of messages（tweets） are the same or similar. Those messages bring up a trouble on the analysis and message storage, so it is needed to remove those messages which are the same or similar. According to the characteristics of short text on tweets, this paper proposes the following approach. It processes the same tweets based on the specific format, then uses the simhash to process the similar tweets. It uses 240 million tweets crawled on the Internet to experiment. In the experiment it only processes Chinese and English tweets. The repetition messages（tweets） is 10 percent of all the Chinese and English tweets.

Key words: twitter, microblog, Simhash, short text duplicate removal

徐凯1，2，沙瀛2，李阳3，单既喜2，王晓岩2. Twitter中重复消息的分析和处理[J]. 计算机工程与应用, 2014, 50(21): 111-115.

XU Kai1，2, SHA Ying2, LI Yang3, SHAN Jixi2, WANG Xiaoyan2. Twitter repeat messages analysis and processing[J]. Computer Engineering and Applications, 2014, 50(21): 111-115.

[1]	赵圆丽，梁志剑. 基于异核卷积双注意机制的立场检测研究[J]. 计算机工程与应用, 2021, 57(8): 119-125.
[2]	吴迪，张梦甜，生龙，黄竹韵，顾明星. 改进在线词对主题模型的微博热点话题演化[J]. 计算机工程与应用, 2021, 57(24): 179-184.
[3]	沈瑞琳，潘伟民，彭成，尹鹏博. 基于多任务学习的微博谣言检测方法[J]. 计算机工程与应用, 2021, 57(24): 192-197.
[4]	陈小寒，魏书宁，覃正泽. 基于深度学习可视化的恶意软件家族分类[J]. 计算机工程与应用, 2021, 57(22): 131-138.
[5]	李东昊，杨文忠，仲丽君，张志豪，王雪颖. 基于重点博文的突发事件检测方法[J]. 计算机工程与应用, 2020, 56(4): 175-183.
[6]	黄学波，徐正国，燕继坤. 基于Simhash的协议数据高频相似序列提取算法[J]. 计算机工程与应用, 2020, 56(16): 199-203.
[7]	张航，盛志伟，张仕斌，杨敏. Simhash算法在文本去重中的应用[J]. 计算机工程与应用, 2020, 56(11): 246-251.
[8]	李鹏飞1，董旭1，仲兆满2，3，李存华2. 基于微博用户兴趣话题的相似用户挖掘[J]. 计算机工程与应用, 2019, 55(11): 102-109.
[9]	高永兵1，张贵娟1，胡文江1，马占飞2. 基于后缀树算法的地区微博摘要技术研究[J]. 计算机工程与应用, 2018, 54(9): 126-132.
[10]	刘琰，张进，陈静，尹美娟，张伟丽. 基于最大频繁项集挖掘的微博炒作群体发现方法[J]. 计算机工程与应用, 2017, 53(4): 90-97.
[11]	奠雨洁，金琴，吴慧敏. 基于多文本特征融合的中文微博的立场检测[J]. 计算机工程与应用, 2017, 53(21): 77-84.
[12]	陈红阳，汪林林，鲁江坤，唐志，王飞雪. 基于双态模型的微博话题跟踪方法研究[J]. 计算机工程与应用, 2017, 53(16): 144-148.
[13]	朱金奇1，2，张兆年1，马春梅1，刘念伯2，鲁力2. 基于地理近邻关系的微博系统朋友推荐[J]. 计算机工程与应用, 2017, 53(13): 72-77.
[14]	段旭磊，张仰森，郭正斌. 微博文本聚类中特征扩展策略研究[J]. 计算机工程与应用, 2017, 53(13): 90-94.
[15]	孙曰昕，马慧芳，姚伟，张志昌. 结合互信息和主题模型的微博话题发现方法[J]. 计算机工程与应用, 2016, 52(6): 61-66.