一种基于结构分析的网页主题区域发现方法

计算机工程与应用 ›› 2015, Vol. 51 ›› Issue (6): 227-230.

一种基于结构分析的网页主题区域发现方法

伊政，徐武平，徐爱萍

武汉大学计算机学院，武汉 430072

出版日期:2015-03-15 发布日期:2015-03-13

Discovery method of webpage subject area based on structural analysis

YI Zheng, XU Wuping, XU Aiping

Computer School of Wuhan University, Wuhan 430072, China

Online:2015-03-15 Published:2015-03-13

摘要/Abstract

摘要： 随着互联网的发展，Web数据挖掘在帮助人们获取主题信息方面越来越具有重要意义。本研究基于树结构，将Web网页解析为标签树；在树匹配算法的基础上，提出了数据区域挖掘和语义链接块识别算法，实现了去链接的预处理；提出了文本结构权重的概念，并采用文本结构权重的计算结果发现主题区域，去噪后获得主题信息。实验表明该研究结果对新闻、博客类网页具有很好的识别效果。

关键词: 信息抽取, 主题区域, 文本结构权重, 去噪

Abstract: Along with the development of the Internet, the Web Data Mining（DM） is becoming more and more significant with regard to the acquisition of thematic information. This paper parses the webpage into tag trees based on the tree structure, puts forward the data range mining and semantic chained block recognition algorithm based on the tree matching algorithm, carries out the preprocessing for unlinking, raises the concept of text structure weight, discovers the subject area with the calculation result of text structure weight and acquires the thematic information after denoising. The experiment shows that the research result displayed in this paper is of great importance to the identification of news and blog webpage.

Key words: information extraction, subject area, text structure weight, denoising

伊政，徐武平，徐爱萍. 一种基于结构分析的网页主题区域发现方法[J]. 计算机工程与应用, 2015, 51(6): 227-230.

YI Zheng, XU Wuping, XU Aiping. Discovery method of webpage subject area based on structural analysis[J]. Computer Engineering and Applications, 2015, 51(6): 227-230.

[1]	刘迪，贾金露，赵玉卿，钱育蓉. 基于深度学习的图像去噪方法研究综述[J]. 计算机工程与应用, 2021, 57(7): 1-13.
[2]	杨倩，顾磊. 基于去噪字词联合模型的中文命名实体识别[J]. 计算机工程与应用, 2021, 57(7): 151-157.
[3]	倪宗军，陈辉，张昀，苏敏，郑秀娟. 自适应去噪的非接触式生理参数检测方法[J]. 计算机工程与应用, 2021, 57(5): 153-160.
[4]	陈人和，赖振意，钱育蓉. 改进的生成对抗网络图像去噪算法[J]. 计算机工程与应用, 2021, 57(5): 168-172.
[5]	隗昊，周爱，张益嘉，陈飞，屈雯，鲁明羽. 深度学习生物医学实体关系抽取研究综述[J]. 计算机工程与应用, 2021, 57(21): 14-23.
[6]	王洁，金正猛，冯灿. 自适应广义全变差的图像泊松去噪算法[J]. 计算机工程与应用, 2021, 57(20): 203-209.
[7]	呼亚萍，孔韦韦，李萌，黄翠玲. 改进TV图像去噪模型的全景图像拼接算法[J]. 计算机工程与应用, 2021, 57(17): 203-209.
[8]	吴呈，王朝坤，王沐贤. 基于文本化简的实体属性抽取方法[J]. 计算机工程与应用, 2020, 56(21): 115-122.
[9]	刘成士，赵志刚，李强，吕慧显，董晓晨，李金霞. 加强的低秩表示图像去噪算法[J]. 计算机工程与应用, 2020, 56(2): 216-225.
[10]	袁小军，周涛，李琛. 基于稀疏先验的非局域聚类图像去噪算法研究[J]. 计算机工程与应用, 2020, 56(18): 177-185.
[11]	杨永鹏，杨真真，李建林，乐俊. 低秩稀疏分解及其在视频和图像处理中的应用[J]. 计算机工程与应用, 2020, 56(16): 21-30.
[12]	钱满，张向阳，李仁昌. 改进卷积神经网络SAR图像去噪算法[J]. 计算机工程与应用, 2020, 56(14): 176-182.
[13]	杜丽美，连玮. 高斯投影下的人机交互式物体仿真算法研究[J]. 计算机工程与应用, 2020, 56(11): 185-191.
[14]	刘秀平1，薛婷婷1，韩丽丽2，杜勇辰1，张凯兵1，闫焕营3. 基于联合稀疏变换学习的工件去噪方法研究[J]. 计算机工程与应用, 2019, 55(7): 188-193.
[15]	杨赵琪璘，彭定涛，唐琦，罗孝敏. 基于稀疏优化lp正则化的光滑化拟牛顿算法[J]. 计算机工程与应用, 2019, 55(22): 163-171.