计算机工程与应用 ›› 2012, Vol. 48 ›› Issue (31): 115-119.

• 数据库、信号与信息处理 • 上一篇    下一篇

面向Web文本关键词自动抽取的DON模型研究

彭  浩1,蔡美玲1,2,王瑞龙3,余炳锐1   

  1. 1.湖南涉外经济学院 计算机科学与技术学院,长沙 410205
    2.中南大学 信息科学与工程学院,长沙 410083
    3.河南信阳供电公司,河南 信阳 464000
  • 出版日期:2012-11-01 发布日期:2012-10-30

Document object network model for extracting keywords from Web pages

PENG Hao1, CAI Meiling1,2, WANG Ruilong3, YU Bingrui1   

  1. 1.College of Computer Science and Technology, Hunan International and Economics University, Changsha 410205, China
    2.Institute of Information Science and Engineering, Central and South University, Changsha 410083, China
    3.Henan Province Xinyang Electric Power, Xinyang, Henan 464000, China
  • Online:2012-11-01 Published:2012-10-30

摘要: Web网页中往往包含许多主题噪声,准确地自动抽取关键词成为技术难点。提出了一个文本对象网络模型DON,给出了对象节点的中心度概念和基于中心度的影响因子传播规则,并据此自动聚集DON中的主题社区(topic society),从而提高了模型的抗噪能力。提出一个基于DON的网页关键词自动抽取算法KEYDON(Keywords Extraction Algorithm Based on DON)。实验结果表明,与基于DocView模型的相应算法相比,KEYDON的准确率提高了近20%,这说明DON模型具有较强的抑制主题噪声能力。

关键词: 文本对象网络, DON, 中心度, 影响因子, 关键词自动抽取, 网页

Abstract: It is very hard to exactly extract keywords from hub Web pages because of its topic noise. A Document Object Network(DON) model and Keywords Extraction Algorithm Based on it (KEYDON) are proposed. The model DON clusters the topic society with the betweenness centrality and impact fraction of nodes in DON. Experiments show that the accuracy of proposed keywords extraction algorithm’s performance based on DON has increased by 20% compared with the algorithm based on DocView model.

Key words: document object network, Document Object Network(DON), betweenness centrality, impact fraction, keywords extraction, Web page