Extracting Topic Information of Web Page based on Entropy

Computer Engineering and Applications ›› 2007, Vol. 43 ›› Issue (4): 164-166.

• 数据库与信息处理 • Previous Articles Next Articles

Extracting Topic Information of Web Page based on Entropy

Received:2005-09-21 Revised:1900-01-01 Online:2007-02-01 Published:2007-02-01

一种基于信息熵的Web页面主题信息抽取方法

贺智平徐学洲李爱玲

西安电子科技大学软件工程研究所西安电子科技大学软件工程研究所

通讯作者: 贺智平

Abstract

Abstract: This paper presents a method of information extraction by pruning the nodes of which information entropy production reach a certain extent. Firstly, a DOM tree is constructed by parsing HTML document. Then, the nodes which don't need to be dealt with are filtrated out, and a STU tree is created. Lastly, the nodes whose information entropy's increase overtops the threshold value are pruned, and the topic information of the Web pages is obtained. The primary experiment result proves the validity of the method using for extracting Web page's information. The mathematical model of the method is simple and credible, so it can work automatically without intervention of people. This method can be applied to Web data mining and information extraction for mobile device such as PDA etc.

Key words: Web, extraction, STU-DOM Tree, information entropy

摘要： 提出了一种剪枝信息熵增较大结点的信息抽取方法。通过对HTML文档解析来构造DOM树。根据配置过滤掉不需处理的相关内容并建立语义模型树，最后对熵增超过阈值的结点进行剪枝并输出抽取的主题信息页面。初步实验结果验证了用这种方法进行Web页面信息抽取的有效性。方法的数学模型简单可靠，基本不需要人工干预即可完成主题信息抽取。可应用于Web数据挖掘系统以及PDA等移动设备的信息获取方面。

关键词: Web, 抽取, STU-DOM树, 信息熵

贺智平徐学洲李爱玲. 一种基于信息熵的Web页面主题信息抽取方法[J]. 计算机工程与应用, 2007, 43(4): 164-166.

[1]	YANG Rongying, HE Qing, DU Nisuo. Chinese Named Entity Recognition Based on Gated Multi-Feature Extractors [J]. Computer Engineering and Applications, 2022, 58(8): 117-124.
[2]	GUO Xinwei, MA Nan, LIU Weifeng, SUN Fuchun, ZHANG Jinli, CHEN Yang, ZHANG Guoping. Expression Recognition and Interaction of Pharyngeal Swab Collection Robot [J]. Computer Engineering and Applications, 2022, 58(8): 125-135.
[3]	ZHAO Jiang, MENG Chenyang, WANG Xiaobo, HAO Chongqing, LI Ran, LIU Huixian, WANG Zhaolei. Modeling and Analysis of AGV Grid Method Based on Feature Points Extraction [J]. Computer Engineering and Applications, 2022, 58(8): 156-167.
[4]	CAO Yukun, SUN Tao. Chinese Event Argument Extraction Based on GLSTM and Attention [J]. Computer Engineering and Applications, 2022, 58(6): 157-163.
[5]	ZHAO Hong, FU Zhaoyang, ZHAO Fan. Microblog Sentiment Analysis Based on BERT and Hierarchical Attention [J]. Computer Engineering and Applications, 2022, 58(5): 156-162.
[6]	CHEN Zhigang, SONG Xinxia, ZHENG Mengce, LIU Tiancheng. Research on Bibliometric Analysis of Fully Homomorphic Encryption [J]. Computer Engineering and Applications, 2022, 58(4): 40-51.
[7]	XIONG Zhongmin, MA Haiyu, LI Shuai, ZHANG Na. Summary of Application and Prospect Analysis of Knowledge Graphs in Marine Field [J]. Computer Engineering and Applications, 2022, 58(3): 15-33.
[8]	HU Chunsheng, YAN Xiaopeng, WEI Hongxing, LI Guoli. Survey of Target Detection and Trajectory Prediction Based on Stereo Vision [J]. Computer Engineering and Applications, 2022, 58(3): 50-65.
[9]	XIAO Xue, LI Chengcheng. Research Progress on Evaluation Methods of Handwritten Chinese Characters [J]. Computer Engineering and Applications, 2022, 58(2): 27-42.
[10]	MU Xiaolin, NIU Kunlong, CAI Shirong, YANG Xiankun, WANG Jinnian. Technical Framework and Advances of Open Source Web Geographic Information System [J]. Computer Engineering and Applications, 2022, 58(15): 37-51.
[11]	HUANG Wei, LIU Guiquan. Study on Hierarchical Multi-Label Text Classification Method of MSML-BERT Model [J]. Computer Engineering and Applications, 2022, 58(15): 191-201.
[12]	ZHU Mixue, LIU Zhiqiang, ZHANG Xu, LI Wenjing, SU Jiaxin. Review of Research on Video-Based Smoke Detection Algorithms [J]. Computer Engineering and Applications, 2022, 58(14): 16-26.
[13]	CHEN Honghua, CEN Jian, LIU Xi, YANG Zhuohong. Research Progress of Deep Learning in Fault Diagnosis of Chemical Process Industry [J]. Computer Engineering and Applications, 2022, 58(13): 48-62.
[14]	WU Qirui, HUANG Shucheng. Intrusion Detection Algorithm Combining Convolutional Neural Network and Three-Branch Decision [J]. Computer Engineering and Applications, 2022, 58(13): 119-127.
[15]	YUAN Jinli, ZHAO Linlin, GUO Zhitao, SU Yi, LU Chenggang. Improved U-Shaped Residual Network for Lung Nodule Detection [J]. Computer Engineering and Applications, 2022, 58(13): 195-203.

Extracting Topic Information of Web Page based on Entropy

一种基于信息熵的Web页面主题信息抽取方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics