Combing lexical features and LDA for semantic relatedness measure

doi:10.3778/j.issn.1002-8331.1606-0088

Computer Engineering and Applications ›› 2017, Vol. 53 ›› Issue (12): 152-157.DOI: 10.3778/j.issn.1002-8331.1606-0088

Previous Articles Next Articles

Combing lexical features and LDA for semantic relatedness measure

XIAO Bao1, LI Pu2，3, JIANG Yuncheng2

1.School of Electronics and Information Engineering, Qinzhou University, Qinzhou, Guangxi 535011, China
2.School of Computer Science, South China Normal University, Guangzhou 510631, China
3.Software Engineering College, Zhengzhou University of Light Industry, Zhengzhou 450000, China

Online:2017-06-15 Published:2017-07-04

混合词汇特征和LDA的语义相关度计算方法

肖宝1，李璞2，3，蒋运承2

1.钦州学院电子与信息工程学院，广西钦州 535011
2.华南师范大学计算机学院，广州 510631
3.郑州轻工业学院软件学院，郑州 450000

Abstract

Abstract: Computing semantic relatedness in text documents is a key problem in many domains, for example, Natural Language Processing （NLP）, Semantic Information Retrieval （SIR）, etc. ESA （Explicit Semantic Analysis） for Wikipedia has received wide attention and applied mainly because of its simplicity and effectivity. However, use of ESA in semantic relatedness computation is inefficient due to its redundant concepts and high dimensionality. This paper presents a new technique based on LDA （Latent Dirichlet Allocation） and JSD （Jensen-Shannon Divergence） to computer semantic relatedness between text documents. The LDA is employed to reduce dimensionality and improve efficiency, and is used to build topic model probability vector from highly dimensional document matrix. Instead of cosine distance, JSD is used to compute semantic relatedness between documents. The results show that this technique based on LDA and JSD is more effective than ESA. Several benchmark test results have been presented to compare proposed technique with other methods. The results of experiment show that the proposed technique provides an increase of above 3% and 9% in Pearson correlation coefficient than ESA and LDA, respectively.

Key words: topic model, lexical features, Explicit Semantic Analysis（ESA）, Latent Dirichlet Allocation（LDA）, semantic relatedness measure

摘要： 文本语义相关度计算在自然语言处理、语义信息检索等方面起着重要作用，以Wikipedia为知识库，基于词汇特征的ESA（Explicit Semantic Analysis）因简单有效的特点在这些领域中受到学术界的广泛关注和应用。然而其语义相关度计算因为有大量冗余概念的参与变成了一种高维度、低效率的计算方式，同时也忽略了文本所属主题因素对语义相关度计算的作用。引入LDA（Latent Dirichlet Allocation）主题模型，对ESA返回的相关度较高的概念转换为模型的主题概率向量，从而达到降低维度和提高效率的目的；将JSD距离（Jensen-Shannon Divergence）替换余弦距离的测量方法，使得文本语义相关度计算更加合理和有效。最后对不同层次的数据集进行算法的测试评估，结果表明混合词汇特征和主题模型的语义相关度计算方法的皮尔逊相关系数比ESA和LDA分别高出3%和9%以上。

关键词: 主题模型, 词汇特征, 显式语义分析（ESA）, 隐含狄利克雷分布（LDA）, 语义相关度计算

XIAO Bao1, LI Pu2，3, JIANG Yuncheng2. Combing lexical features and LDA for semantic relatedness measure[J]. Computer Engineering and Applications, 2017, 53(12): 152-157.

肖宝1，李璞2，3，蒋运承2. 混合词汇特征和LDA的语义相关度计算方法[J]. 计算机工程与应用, 2017, 53(12): 152-157.

[1]	ZHENG Cheng, DONG Chunyang, HUANG Xiayan. Short Text Classification Method Based on BTM Graph Convolutional Network [J]. Computer Engineering and Applications, 2021, 57(4): 155-160.
[2]	WU Di, ZHANG Mengtian, SHENG Long, HUANG Zhuyun, GU Mingxing. Microblog Hot Topic Evolution Based on Improved On-Line Biterm Topic Model [J]. Computer Engineering and Applications, 2021, 57(24): 179-184.
[3]	DING Yong, CHENG Jiaqiao, JIANG Cuiqing, WANG Zhao. Comparative Text Classification Method Based on Topic and Keyword Feature [J]. Computer Engineering and Applications, 2021, 57(17): 196-202.
[4]	TANG Huanling, LIU Yanhong, ZHENG Han, DOU Quansheng, LU Mingyu. Imbalanced Text Categorization Method with SLDA Topic Model [J]. Computer Engineering and Applications, 2021, 57(12): 144-154.
[5]	HU Can, CUI Xiaohui. Research on SNS Users’ Posting Behavior and Interest Prediction [J]. Computer Engineering and Applications, 2020, 56(9): 99-105.
[6]	ZHANG Weiwei, HU Yaqi, ZHAI Guangyu, LIU Zhipeng. Academic Abstract Clustering Method Based on LDA Model and Doc2vec [J]. Computer Engineering and Applications, 2020, 56(6): 180-185.
[7]	CHEN Huan, HUANG Bo, ZHU Yimin, YU Lei, YU Yuxin. Short Text Emotion Classification Method Combining LDA and Self-Attention [J]. Computer Engineering and Applications, 2020, 56(18): 165-170.
[8]	PENG Boyuan, PENG Dongliang, GU Yu, PENG Junli. Trend Prediction for Mega-Event by Fusing Semantics and Event Characteristics [J]. Computer Engineering and Applications, 2020, 56(17): 173-180.
[9]	WANG Limiao, XU Qinglin, JIANG Wenchao, FU Jigao. Short Video Preference Rate Prediction Model with Integrated FM [J]. Computer Engineering and Applications, 2020, 56(14): 118-122.
[10]	QIN Xu, YANG Wenzhong, WANG Xueying, MA Guoxiang, WANG Qingpeng. Multi-source Topic Fusion Model Based on Co-occurrence Relation [J]. Computer Engineering and Applications, 2020, 56(10): 157-162.
[11]	CHEN Zeyu, HUANG Bo. Research on User Portrait of Improved Word Vector Model [J]. Computer Engineering and Applications, 2020, 56(1): 180-184.
[12]	ZHU Hongzhen1, CHEN Pinghua1, CAI Guilan2. Research on Application of Relationship Mining into Red Wine Data Based on LDA Model [J]. Computer Engineering and Applications, 2019, 55(4): 148-153.
[13]	WANG Kui1, FEI Chenjie1, LIU Baisong1，2. Convolutional Neural Network Themed Reptile Research Based on LDA [J]. Computer Engineering and Applications, 2019, 55(11): 123-128.
[14]	CHENG Lei, GAO Maoting. Hybrid Recommendation Algorithm Based on Time Weighted and LDA Clustering [J]. Computer Engineering and Applications, 2019, 55(11): 160-166.
[15]	ZHAO Huiru，LIN Min. Topic learning and representation method of academic papers with association rules [J]. Computer Engineering and Applications, 2018, 54(20): 159-165.

Combing lexical features and LDA for semantic relatedness measure

混合词汇特征和LDA的语义相关度计算方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics