Computer Engineering and Applications ›› 2014, Vol. 50 ›› Issue (19): 199-204.

Forest products trading Web messages extraction algorithm based on semantic

LI Jia, XU Qian, WANG Zi, CHEN Zhao   

  1. School of Information Science and Technology, Beijing Forestry University, Beijing 100083, China
  • Online:2014-10-01 Published:2014-09-29


李  嘉,徐  前,王  梓,陈  钊   

  1. 北京林业大学 信息学院,北京 100083

Abstract: Based on the shortages of the existing Web information extraction technique in the presence of the accuracy is not high, a low degree of automation and the weaker commonality, combined with the structured storage needs of information source in forest products trade Web information push, a new algorithm on forest products trading Web messages structuring based on semantic is proposed. The paper analyzes and takes advantage of forest products trade Web information feature, and combined with the basic principle of semantic recognition, it constructs of the forest product trade semantic dictionary, uses the layout features of the target information that need to extract in the Web pages at the same time and combined with the information entropy theory, a method of target information automatic extraction based on the semantic information entropy is proposed to extract target information, and the information is stored in the database as a structured form. The experiments on actual forest product trade Web pages information extraction, prove that this algorithm can reduce manual intervention and has good value in processing information source in forest products trade information push.

Key words: Web information extraction, forest product trade semantic dictionary, semantic information entropy, template, target information location

摘要: 针对现有Web信息抽取技术存在的准确率不高,自动化程度较低以及通用性较弱等诸多不足,结合林产品贸易Web信息推送中对信息源进行结构化存储的需要,提出一种新的基于语义的林产品贸易Web信息抽取算法;充分分析并利用林产品贸易Web信息的特征,结合语义识别的基本原理,构建林产品贸易语义词典,同时利用所需抽取的目标信息在网页中的布局特征,结合信息熵理论提出了基于语义信息熵的目标信息自动定位抽取方法,以抽取需要的目标信息,并以一种结构化的形式存储于数据库中。通过实验对实际林产品贸易Web信息网页的抽取,证明了该算法能够降低人工干预,在林产品贸易信息推送中对信息源的处理具有较好的应用价值。

关键词: Web信息抽取, 林产品贸易语义词典, 语义信息熵, 模板, 目标信息定位