基于结构相似网页聚类的正文提取算法研究

doi:10.3778/j.issn.1002-8331.1701-0161

计算机工程与应用 ›› 2018, Vol. 54 ›› Issue (11): 122-127.DOI: 10.3778/j.issn.1002-8331.1701-0161

基于结构相似网页聚类的正文提取算法研究

王海涌，冯兆旭，杨海波，张津栋

兰州交通大学电子与信息工程学院，兰州 730070

出版日期:2018-06-01 发布日期:2018-06-14

Research on text extraction algorithm based on structure similarity page clustering

WANG Haiyong, FENG Zhaoxu, YANG Haibo, ZHANG Jindong

School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China

Online:2018-06-01 Published:2018-06-14

摘要/Abstract

摘要： 针对当前互联网网页越来越多样化、复杂化的特点，提出一种基于结构相似网页聚类的网页正文提取算法，首先，根据组成网页前端模板各“块”对模板的贡献赋以不同的权重，其次计算两个网页中对应块的相似度，将各块的相似度与权重乘积的总和作为两个网页的相似度。该算法充分考虑结构差别较大的网页对网页正文提取的影响，通过计算网页间相似度将网页聚类，使得同一簇中的网页正文提取结果更加准确。实验结果表明，该方法具有更高的准确率，各项评价指标均有所提高。

关键词: 正文提取, 相似性, 文档对象模型（DOM）树, 层次聚类

Abstract: The current Web pages are getting more and more diverse, complex which makes the information extraction more difficult. In this paper, a text extraction algorithm based on structure similarity page clustering is proposed. Firstly, the contribution of each “block” to the template is assigned to different weights according to the composition of the front page of the Web page. Secondly, the similarity of the corresponding blocks in the two Web pages is calculated. The similarity and the weight of each block product as the sum of the two pages’ similarity. This algorithm takes into account the influence of Web page structure difference on Web page text extraction. Web page is clustered based on computing the similarity between Web pages. The results are more accurate for the Web page text in the same cluster. The experimental results show that the method has higher accuracy and the evaluation indexes are improved.

Key words: information extraction, similarity, Document Object Model（DOM） tree, hierarchical clustering

王海涌，冯兆旭，杨海波，张津栋. 基于结构相似网页聚类的正文提取算法研究[J]. 计算机工程与应用, 2018, 54(11): 122-127.

WANG Haiyong, FENG Zhaoxu, YANG Haibo, ZHANG Jindong. Research on text extraction algorithm based on structure similarity page clustering[J]. Computer Engineering and Applications, 2018, 54(11): 122-127.

[1]	张晓闻，任勇峰. 结合稀疏表示与拓扑相似性的图像匹配算法[J]. 计算机工程与应用, 2021, 57(8): 198-203.
[2]	王俊玲，卢新明. 基于语义相关的视频关键帧提取算法[J]. 计算机工程与应用, 2021, 57(4): 192-198.
[3]	蒋斌，梁小安，张亮，高杨军. 基于改进修正权重的证据组合方法[J]. 计算机工程与应用, 2021, 57(24): 100-106.
[4]	卫鼎峰，李梁，柴晶. 融合物品信息的社会化推荐算法[J]. 计算机工程与应用, 2021, 57(19): 198-204.
[5]	张陶，于炯，廖彬，毕雪华. 基于最小描述长度原则的属性图概要方法[J]. 计算机工程与应用, 2021, 57(15): 124-132.
[6]	石晨，张宇，胡博. 基于共同语境的近义词/同义词短语查找模型[J]. 计算机工程与应用, 2021, 57(14): 142-147.
[7]	袁中臣，马宗民. 基于语义的UML类图的集成分类[J]. 计算机工程与应用, 2021, 57(12): 257-262.
[8]	任益辰，肖达. 基于程序双维度特征的恶意程序相似性分析[J]. 计算机工程与应用, 2021, 57(1): 118-125.
[9]	应文杰，桑基韬. 改进的哈希学习高效推荐算法[J]. 计算机工程与应用, 2020, 56(9): 75-83.
[10]	安宁，江思源，唐晨，杨矫云. 融合单纯形映射与熵加权的聚类方法[J]. 计算机工程与应用, 2020, 56(9): 148-155.
[11]	洪征，龚启缘，冯文博，李毅豪. 自适应聚类的未知应用层协议识别方法[J]. 计算机工程与应用, 2020, 56(5): 109-117.
[12]	徐戈，杨晓燕，汪涛. 单词语义相似性计算综述[J]. 计算机工程与应用, 2020, 56(4): 9-15.
[13]	王见，毛黎明，尹爱军. 结合形状特征及其上下文的多维DTW[J]. 计算机工程与应用, 2020, 56(22): 42-47.
[14]	王工书，任尊晓，李丹丹，相洁，王彬. 脑激活任务区分度的分析及应用研究[J]. 计算机工程与应用, 2020, 56(21): 272-278.
[15]	刘成士，赵志刚，李强，吕慧显，董晓晨，李金霞. 加强的低秩表示图像去噪算法[J]. 计算机工程与应用, 2020, 56(2): 216-225.

基于结构相似网页聚类的正文提取算法研究

Research on text extraction algorithm based on structure similarity page clustering

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics