Computer Engineering and Applications ›› 2018, Vol. 54 ›› Issue (11): 122-127.DOI: 10.3778/j.issn.1002-8331.1701-0161

Previous Articles     Next Articles

Research on text extraction algorithm based on structure similarity page clustering

WANG Haiyong, FENG Zhaoxu, YANG Haibo, ZHANG Jindong   

  1. School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China
  • Online:2018-06-01 Published:2018-06-14

基于结构相似网页聚类的正文提取算法研究

王海涌,冯兆旭,杨海波,张津栋   

  1. 兰州交通大学 电子与信息工程学院,兰州 730070

Abstract: The current Web pages are getting more and more diverse, complex which makes the information extraction more difficult. In this paper, a text extraction algorithm based on structure similarity page clustering is proposed. Firstly, the contribution of each “block” to the template is assigned to different weights according to the composition of the front page of the Web page. Secondly, the similarity of the corresponding blocks in the two Web pages is calculated. The similarity and the weight of each block product as the sum of the two pages’ similarity. This algorithm takes into account the influence of Web page structure difference on Web page text extraction. Web page is clustered based on computing the similarity between Web pages. The results are more accurate for the Web page text in the same cluster. The experimental results show that the method has higher accuracy and the evaluation indexes are improved.

Key words: information extraction, similarity, Document Object Model(DOM) tree, hierarchical clustering

摘要: 针对当前互联网网页越来越多样化、复杂化的特点,提出一种基于结构相似网页聚类的网页正文提取算法,首先,根据组成网页前端模板各“块”对模板的贡献赋以不同的权重,其次计算两个网页中对应块的相似度,将各块的相似度与权重乘积的总和作为两个网页的相似度。该算法充分考虑结构差别较大的网页对网页正文提取的影响,通过计算网页间相似度将网页聚类,使得同一簇中的网页正文提取结果更加准确。实验结果表明,该方法具有更高的准确率,各项评价指标均有所提高。

关键词: 正文提取, 相似性, 文档对象模型(DOM)树, 层次聚类