Computer Engineering and Applications ›› 2020, Vol. 56 ›› Issue (5): 179-185.DOI: 10.3778/j.issn.1002-8331.1811-0304

Previous Articles     Next Articles

Keyphrase Extraction Algorithm Integrating Word Embeddings and Position Information

FAN Wei, LIU Huan, ZHANG Yuxiang   

  1. School of Computer Science and Technology, Civil Aviation University of China, Tianjin 300300, China
  • Online:2020-03-01 Published:2020-03-06

融合词向量与位置信息的关键词提取算法

樊玮,刘欢,张宇翔   

  1. 中国民航大学 计算机科学与技术学院,天津 300300

Abstract:

Focused on the issue that the existing graph-based keyphrase extraction methods fail to integrate the potential semantic relationship among words in text sequences, a graph-based keyphrase extraction algorithm EPRank that integrates word embeddings and position information is proposed. First, the word embedding of each word in the target document is learned by the word embedding representation model. Secondly, the word embeddings which reflect the potential semantic relationship among words and position information are combined into the PageRank scoring model. Finally, it selects a few top-ranked words or phrases as keyphrases for the target document. The experimental results show that the proposed algorithm EPRank can achieve higher values in terms of every evaluation metric on KDD and SIGIR datasets than the five existing keyphrase extraction methods.

Key words: keyphrase extraction, word embedding, position information, PageRank algorithm

摘要:

针对现有的基于图的关键词提取方法未能有效整合文本序列中词与词之间的潜在语义关系的问题,提出了一个融合词向量与位置信息的基于图的关键词提取算法EPRank。通过词向量表示模型学得目标文档中每个词的表示向量;将该反映词与词之间的潜在语义关系的词向量与位置特征相结合融合到PageRank评分模型中;选择几个排名靠前的单词或短语作为目标文档的关键词。实验结果表明,提出的EPRank方法在KDD和SIGIR两个数据集上的各项评估指标均高于5个现有的关键词提取方法。

关键词: 关键词提取, 词向量, 位置信息, PageRank算法