Computer Engineering and Applications ›› 2014, Vol. 50 ›› Issue (19): 222-226.

Previous Articles     Next Articles

Keyword extraction from Chinese news Web pages based on multi-features

YUAN Jinsheng, MAO Xinwu   

  1. School of Information, Beijing Forestry University, Beijing 100083, China
  • Online:2014-10-01 Published:2014-09-29

基于组合特征的中文新闻网页关键词提取方法

袁津生,毛新武   

  1. 北京林业大学 信息学院,北京 100083

Abstract: Considering the characteristics of Chinese news Web pages, this paper uses many features including statistical feature, position feature and POS(Part of Speech) feature to evaluate the weight of candidate keywords. In order to solve the problem of that some segmentation cannot reflect the theme, this paper proposes a compound words generation method based on directed graph, which aims to find adjacency words for compound words. The experimental results show that this method is vastly superior to the conventional TF-IDF method in efficiency and can extract keyword from news Web page efficiently.

Key words: keyword extraction, multi-features, compound words, directed graph, news Web page

摘要: 针对中文新闻网页的特点,使用了包括统计特征、位置特征和词性特征等在内的多种特征综合评定候选关键词的权重大小。对于部分分词结果不能良好地反映主题的问题,提出了一种基于有向图的组合词生成方法,旨在找出高频次的相邻词作为组合词。实验结果表明,该方法较传统的TF-IDF方法效率有较大提升,能够有效提取出新闻网页关键词。

关键词: 关键词提取, 组合特征, 组合词, 有向图, 新闻网页