Computer Engineering and Applications ›› 2017, Vol. 53 ›› Issue (12): 152-157.DOI: 10.3778/j.issn.1002-8331.1606-0088

Previous Articles     Next Articles

Combing lexical features and LDA for semantic relatedness measure

XIAO Bao1, LI Pu2,3, JIANG Yuncheng2   

  1. 1.School of Electronics and Information Engineering, Qinzhou University, Qinzhou, Guangxi 535011, China
    2.School of Computer Science, South China Normal University, Guangzhou 510631, China
    3.Software Engineering College, Zhengzhou University of Light Industry, Zhengzhou 450000, China
  • Online:2017-06-15 Published:2017-07-04

混合词汇特征和LDA的语义相关度计算方法

肖  宝1,李  璞2,3,蒋运承2   

  1. 1.钦州学院 电子与信息工程学院,广西 钦州 535011
    2.华南师范大学 计算机学院,广州 510631
    3.郑州轻工业学院 软件学院,郑州 450000

Abstract: Computing semantic relatedness in text documents is a key problem in many domains, for example, Natural Language Processing (NLP), Semantic Information Retrieval (SIR), etc. ESA (Explicit Semantic Analysis) for Wikipedia has received wide attention and applied mainly because of its simplicity and effectivity. However, use of ESA in semantic relatedness computation is inefficient due to its redundant concepts and high dimensionality. This paper presents a new technique based on LDA (Latent Dirichlet Allocation) and JSD (Jensen-Shannon Divergence) to computer semantic relatedness between text documents. The LDA is employed to reduce dimensionality and improve efficiency, and is used to build topic model probability vector from highly dimensional document matrix. Instead of cosine distance, JSD is used to compute semantic relatedness between documents. The results show that this technique based on LDA and JSD is more effective than ESA. Several benchmark test results have been presented to compare proposed technique with other methods. The results of experiment show that the proposed technique provides an increase of above 3% and 9% in Pearson correlation coefficient than ESA and LDA, respectively.

Key words: topic model, lexical features, Explicit Semantic Analysis(ESA), Latent Dirichlet Allocation(LDA), semantic relatedness measure

摘要: 文本语义相关度计算在自然语言处理、语义信息检索等方面起着重要作用,以Wikipedia为知识库,基于词汇特征的ESA(Explicit Semantic Analysis)因简单有效的特点在这些领域中受到学术界的广泛关注和应用。然而其语义相关度计算因为有大量冗余概念的参与变成了一种高维度、低效率的计算方式,同时也忽略了文本所属主题因素对语义相关度计算的作用。引入LDA(Latent Dirichlet Allocation)主题模型,对ESA返回的相关度较高的概念转换为模型的主题概率向量,从而达到降低维度和提高效率的目的;将JSD距离(Jensen-Shannon Divergence)替换余弦距离的测量方法,使得文本语义相关度计算更加合理和有效。最后对不同层次的数据集进行算法的测试评估,结果表明混合词汇特征和主题模型的语义相关度计算方法的皮尔逊相关系数比ESA和LDA分别高出3%和9%以上。

关键词: 主题模型, 词汇特征, 显式语义分析(ESA), 隐含狄利克雷分布(LDA), 语义相关度计算