Combing node frequency and semantic feature for webpage informative content extraction

doi:10.3778/j.issn.1002-8331.2009.01.044

Computer Engineering and Applications ›› 2009, Vol. 45 ›› Issue (1): 140-143.DOI: 10.3778/j.issn.1002-8331.2009.01.044

• 数据库、信号与信息处理 • Previous Articles Next Articles

Combing node frequency and semantic feature for webpage informative content extraction

MENG Jun,LIU Qiu-shui,WANG Xiu-kun

Department of Computer Science and Engineering，Dalian University of Technology，Dalian 116023，China

Received:2008-07-24 Revised:2008-10-16 Online:2009-01-01 Published:2009-01-01
Contact: MENG Jun

节点频度和语义距离相结合的网页正文信息抽取

孟军,刘秋水,王秀坤

大连理工大学计算机科学与工程系，辽宁大连 116023

通讯作者: 孟军

Abstract

Abstract: A new module named BF-DOM tree is proposed in this paper，which extends the Document Object Module Tree by adding two properties，i.e.，block node frequency and relativity，to some nodes.Using this module combined with semantic distance，this method extracts the primary content accurately from the same source based on three facts：noise nodes always have high node frequency property within a given website；primary content blocks are often made up of few link words and many text words；useful links are contained in a useful content blocks and have a close semantic distance with page titles.Experiment on eight respective websites shows the proposed method can identify the primary content blocks with higher precision and recall rate both above 96% which is better than the entropy based method.The method can reduce the storage requirement for search engines；thus，result in smaller indexes，faster search time，and better user satisfaction.

Key words: information extraction, Block node Frequency-Document Object Module（BF-DOM） tree, node frequency, semantic distance

摘要： 提出了一种带有节点频度的扩展DOM树模型—BF-DOM树模型（Block node Frequency-Document Object Module），并基于此模型进行网页正文信息的抽取。该方法通过向DOM树的某些节点上添加频度和相关度属性来构造文中新的模型，再结合语义距离抽取网页正文信息。方法主要基于以下三点考虑：在同源的网页集合内噪音节点的频度值很高；正文信息一般由非链接文字组成；与正文相关的链接和文章标题有较近的语义距离。针对8个网站的实验表明，该方法能有效地抽取正文信息，召回率和准确率都在96%以上，优于基于信息熵的抽取方法。

关键词: 信息提取, 带有节点频度的文档对象模型树, 节点频度, 语义距离

MENG Jun,LIU Qiu-shui,WANG Xiu-kun. Combing node frequency and semantic feature for webpage informative content extraction[J]. Computer Engineering and Applications, 2009, 45(1): 140-143.

孟军,刘秋水,王秀坤. 节点频度和语义距离相结合的网页正文信息抽取[J]. 计算机工程与应用, 2009, 45(1): 140-143.

[1]	WEI Hao, ZHOU Ai, ZHANG Yijia, CHEN Fei, QU Wen, LU Mingyu. Review of Deep Learning-Based Biomedical Entity Relation Extraction Research [J]. Computer Engineering and Applications, 2021, 57(21): 14-23.
[2]	WU Cheng, WANG Chaokun, WANG Muxian. Entity Attributes Extraction Based on Text Simplification [J]. Computer Engineering and Applications, 2020, 56(21): 115-122.
[3]	HUANG Cheng1，2, LIU Jiayong1, LIU Liang1, HE Xiang1, TANG Dianhua2. Research on extraction model of malicious domain corpus based on context semantics [J]. Computer Engineering and Applications, 2018, 54(9): 101-108.
[4]	WANG Haiyong, FENG Zhaoxu, YANG Haibo, ZHANG Jindong. Research on text extraction algorithm based on structure similarity page clustering [J]. Computer Engineering and Applications, 2018, 54(11): 122-127.
[5]	DU Boyuan1, WANG Meiqing1, CHEN Changfu2, CHEN Fei1. Tags extraction for Web information based on structure consistency and feature learning [J]. Computer Engineering and Applications, 2017, 53(7): 74-78.
[6]	ZHAO Xiaoyong, WANG Lei. Product specification auto extract method of e-commerce websites [J]. Computer Engineering and Applications, 2017, 53(24): 168-171.
[7]	GU Nannan, FENG Jun, SUN Xia, ZHAO Yan, ZHANG Lei. Chinese resume information automatic extraction and recommendation algorithm [J]. Computer Engineering and Applications, 2017, 53(18): 141-148.
[8]	SUN Hongmin, JIANG Nannan, LI Xiang. Research on biological information mining model based on document set [J]. Computer Engineering and Applications, 2016, 52(24): 102-106.
[9]	YI Zheng, XU Wuping, XU Aiping. Discovery method of webpage subject area based on structural analysis [J]. Computer Engineering and Applications, 2015, 51(6): 227-230.
[10]	HUANG Yanjiao, WU Qin, LIANG Jiuzhen. Boosted constrained conditional random fields for Web object information extraction [J]. Computer Engineering and Applications, 2015, 51(23): 143-148.
[11]	QIAO Naosheng1, ZHANG Fen2. Method of defect image edge information extraction of printed circuit board [J]. Computer Engineering and Applications, 2015, 51(20): 11-15.
[12]	ZHANG Feifei1, LI Zonghai2, ZHOU Xiaohui1, LI Xiaoge1,2. Cross-document Chinese personal name entity disambiguation based on hierarchical clustering [J]. Computer Engineering and Applications, 2014, 50(6): 106-111.
[13]	CHANG Lei1, LU Yang1, WU Lei1，2. PDF document across terminal publishing technology [J]. Computer Engineering and Applications, 2014, 50(22): 158-162.
[14]	LI Jia, XU Qian, WANG Zi, CHEN Zhao. Forest products trading Web messages extraction algorithm based on semantic [J]. Computer Engineering and Applications, 2014, 50(19): 199-204.
[15]	WANG Xuyang, WAN Li. Research on semantic similarity in information retrieval [J]. Computer Engineering and Applications, 2014, 50(10): 124-127.

Combing node frequency and semantic feature for webpage informative content extraction

节点频度和语义距离相结合的网页正文信息抽取

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics