Research on text extraction algorithm based on structure similarity page clustering

doi:10.3778/j.issn.1002-8331.1701-0161

Abstract

Abstract: The current Web pages are getting more and more diverse, complex which makes the information extraction more difficult. In this paper, a text extraction algorithm based on structure similarity page clustering is proposed. Firstly, the contribution of each “block” to the template is assigned to different weights according to the composition of the front page of the Web page. Secondly, the similarity of the corresponding blocks in the two Web pages is calculated. The similarity and the weight of each block product as the sum of the two pages’ similarity. This algorithm takes into account the influence of Web page structure difference on Web page text extraction. Web page is clustered based on computing the similarity between Web pages. The results are more accurate for the Web page text in the same cluster. The experimental results show that the method has higher accuracy and the evaluation indexes are improved.

Key words: information extraction, similarity, Document Object Model（DOM） tree, hierarchical clustering

摘要： 针对当前互联网网页越来越多样化、复杂化的特点，提出一种基于结构相似网页聚类的网页正文提取算法，首先，根据组成网页前端模板各“块”对模板的贡献赋以不同的权重，其次计算两个网页中对应块的相似度，将各块的相似度与权重乘积的总和作为两个网页的相似度。该算法充分考虑结构差别较大的网页对网页正文提取的影响，通过计算网页间相似度将网页聚类，使得同一簇中的网页正文提取结果更加准确。实验结果表明，该方法具有更高的准确率，各项评价指标均有所提高。

关键词: 正文提取, 相似性, 文档对象模型（DOM）树, 层次聚类

WANG Haiyong, FENG Zhaoxu, YANG Haibo, ZHANG Jindong. Research on text extraction algorithm based on structure similarity page clustering[J]. Computer Engineering and Applications, 2018, 54(11): 122-127.

王海涌，冯兆旭，杨海波，张津栋. 基于结构相似网页聚类的正文提取算法研究[J]. 计算机工程与应用, 2018, 54(11): 122-127.

[1]	ZHANG Qishan, CHEN Lulu. Slope One Algorithm Based on Grey Correlational Analysis by Method of Degree of Balance and Approach [J]. Computer Engineering and Applications, 2021, 57(9): 96-102.
[2]	WANG Yonggui, LI Qianyu. Hybrid Collaborative Filtering Recommendation Algorithm Based on KNN-GBDT [J]. Computer Engineering and Applications, 2021, 57(9): 103-108.
[3]	ZHANG Songcan, PU Jiexin, SI Yanna, SUN Lifan. Adaptive Improved Ant Colony Algorithm Based on Population Similarity and Its Application [J]. Computer Engineering and Applications, 2021, 57(8): 70-77.
[4]	ZHANG Xiaowen, REN Yongfeng. Image Matching Algorithm Combining Sparse Representation and Topological Similarity [J]. Computer Engineering and Applications, 2021, 57(8): 198-203.
[5]	YANG Fang, YIN Xi, SI Jianhui, LIU Hongyuan, WANG Xue. Mathematical Expression Similarity Calculation Method Based on Focus Clustering [J]. Computer Engineering and Applications, 2021, 57(6): 88-93.
[6]	QIAN Yunyun, YANG Wenzhong, YAO Miao, LI Hailei, CHAI Yachuang. Topic Community Discovery Model Incorporating Topic Similarity Weight [J]. Computer Engineering and Applications, 2021, 57(5): 107-114.
[7]	WANG Junling, LU Xinming. Video Key Frame Extraction Algorithm Based on Semantic Correlation [J]. Computer Engineering and Applications, 2021, 57(4): 192-198.
[8]	JIANG Bin, LIANG Xiao’an, ZHANG Liang, GAO Yangjun. Evidence Combination Method Based on Improved Modified Weight [J]. Computer Engineering and Applications, 2021, 57(24): 100-106.
[9]	TIAN Wei’an, CHEN Hongmei, ZHOU Lihua. Diversified Recommendation Method Based on Similar Users’Curiosity [J]. Computer Engineering and Applications, 2021, 57(23): 113-121.
[10]	WEI Hao, ZHOU Ai, ZHANG Yijia, CHEN Fei, QU Wen, LU Mingyu. Review of Deep Learning-Based Biomedical Entity Relation Extraction Research [J]. Computer Engineering and Applications, 2021, 57(21): 14-23.
[11]	LIANG Tian, CAO Dexin. Improved and Simplified Particle Swarm Optimization Algorithm Based on Levy Flight [J]. Computer Engineering and Applications, 2021, 57(20): 188-196.
[12]	WEI Dingfeng, LI Liang, CHAI Jing. Social Recommendation Algorithm by Fusing Item Information [J]. Computer Engineering and Applications, 2021, 57(19): 198-204.
[13]	LIU Li. Top-N Recommendation Algorithm Based on User Diversity Preference [J]. Computer Engineering and Applications, 2021, 57(17): 116-121.
[14]	YANG Yanjiao, ZHAO Guotao, WANG Pidong. Sentence Similarity Calculation Method Based on Semantics and Emotion [J]. Computer Engineering and Applications, 2021, 57(16): 151-158.
[15]	ZHANG Tao, YU Jiong, LIAO Bin, BI Xuehua. Method for Attributed Graph Summarization Based on Minimum Description Length [J]. Computer Engineering and Applications, 2021, 57(15): 124-132.

Research on text extraction algorithm based on structure similarity page clustering

基于结构相似网页聚类的正文提取算法研究

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics