Discovery method of webpage subject area based on structural analysis

Abstract

Abstract: Along with the development of the Internet, the Web Data Mining（DM） is becoming more and more significant with regard to the acquisition of thematic information. This paper parses the webpage into tag trees based on the tree structure, puts forward the data range mining and semantic chained block recognition algorithm based on the tree matching algorithm, carries out the preprocessing for unlinking, raises the concept of text structure weight, discovers the subject area with the calculation result of text structure weight and acquires the thematic information after denoising. The experiment shows that the research result displayed in this paper is of great importance to the identification of news and blog webpage.

Key words: information extraction, subject area, text structure weight, denoising

摘要： 随着互联网的发展，Web数据挖掘在帮助人们获取主题信息方面越来越具有重要意义。本研究基于树结构，将Web网页解析为标签树；在树匹配算法的基础上，提出了数据区域挖掘和语义链接块识别算法，实现了去链接的预处理；提出了文本结构权重的概念，并采用文本结构权重的计算结果发现主题区域，去噪后获得主题信息。实验表明该研究结果对新闻、博客类网页具有很好的识别效果。

关键词: 信息抽取, 主题区域, 文本结构权重, 去噪

YI Zheng, XU Wuping, XU Aiping. Discovery method of webpage subject area based on structural analysis[J]. Computer Engineering and Applications, 2015, 51(6): 227-230.

伊政，徐武平，徐爱萍. 一种基于结构分析的网页主题区域发现方法[J]. 计算机工程与应用, 2015, 51(6): 227-230.

[1]	LIU Di, JIA Jinlu, ZHAO Yuqing, QIAN Yurong. Overview of Image Denoising Methods Based on Deep Learning [J]. Computer Engineering and Applications, 2021, 57(7): 1-13.
[2]	YANG Qian, GU Lei. Chinese Named Entity Recognition Based on Denoising Joint Character-Word Model [J]. Computer Engineering and Applications, 2021, 57(7): 151-157.
[3]	CHEN Renhe, LAI Zhenyi, QIAN Yurong. Improved Image Denoising Generative Adversarial Network Algorithm [J]. Computer Engineering and Applications, 2021, 57(5): 168-172.
[4]	WEI Hao, ZHOU Ai, ZHANG Yijia, CHEN Fei, QU Wen, LU Mingyu. Review of Deep Learning-Based Biomedical Entity Relation Extraction Research [J]. Computer Engineering and Applications, 2021, 57(21): 14-23.
[5]	AN Lei, HAN Zhonghua, LIN Shuo, SHANG Wenli. Research on GAN-SDAE-RF Model for Network Intrusion Detection [J]. Computer Engineering and Applications, 2021, 57(21): 155-164.
[6]	WANG Jie, JIN Zhengmeng, FENG Can. Adaptive Generalized Total Variation Algorithm for Poisson Noise Removal [J]. Computer Engineering and Applications, 2021, 57(20): 203-209.
[7]	HU Yaping, KONG Weiwei, LI Meng, HUANG Cuiling. Improved Panoramic Image Mosaic Algorithm for TV Image Denoising Model [J]. Computer Engineering and Applications, 2021, 57(17): 203-209.
[8]	WU Cheng, WANG Chaokun, WANG Muxian. Entity Attributes Extraction Based on Text Simplification [J]. Computer Engineering and Applications, 2020, 56(21): 115-122.
[9]	LIU Chengshi, ZHAO Zhigang, LI Qiang, LV Huixian, DONG Xiaochen, LI Jinxia. Enhanced Low-Rank Representation Image Denoising Algorithm [J]. Computer Engineering and Applications, 2020, 56(2): 216-225.
[10]	YUAN Xiaojun, ZHOU Tao, LI Chen. Research on Image Denoising Algorithm Based on Non-local Clustering with Sparse Prior [J]. Computer Engineering and Applications, 2020, 56(18): 177-185.
[11]	CHEN Jian, LIU Ming, XIONG Peng, MENG Xianhui, YANG Lin. ECG Signal Denoising Based on Convolutional Auto-encoder Neural Network [J]. Computer Engineering and Applications, 2020, 56(16): 148-155.
[12]	YANG Yongpeng, YANG Zhenzhen, LI Jianlin, LE Jun. Low Rank and Sparse Decomposition and Its Application in Video and Image Processing [J]. Computer Engineering and Applications, 2020, 56(16): 21-30.
[13]	YAN Ding, LV Donghao, ZHANG Yong. Research on Superposition Iterative Denoising Algorithm Based on Total Variation [J]. Computer Engineering and Applications, 2020, 56(14): 226-230.
[14]	LIU Xiuping1, XUE Tingting1, HAN Lili2, DU Yongchen1, ZHANG Kaibing1, YAN Huanying3. Denoising Based on Union-of-Transforms Learning for Workpieces [J]. Computer Engineering and Applications, 2019, 55(7): 188-193.
[15]	YANG-ZHAO Qilin, PENG Dingtao, TANG Qi, LUO Xiaomin. Smoothing Quasi-Newton Algorithm for [lp] Regularization of Sparse Optimization [J]. Computer Engineering and Applications, 2019, 55(22): 163-171.

Discovery method of webpage subject area based on structural analysis

一种基于结构分析的网页主题区域发现方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics