结合LDA与Word2vec的文本语义增强方法

doi:10.3778/j.issn.1002-8331.2112-0491

计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (13): 135-145.DOI: 10.3778/j.issn.1002-8331.2112-0491

结合LDA与Word2vec的文本语义增强方法

唐焕玲，卫红敏，王育林，朱辉，窦全胜

1.山东工商学院计算机科学与技术学院，山东烟台 264005
2.山东省高等学校协同创新中心：未来智能计算，山东烟台 264005
3.山东省高校智能信息处理重点实验室（山东工商学院），山东烟台 264005
4.山东工商学院信息与电子工程学院，山东烟台 264005
5.上海绘话智能科技有限公司，上海 200120

出版日期:2022-07-01 发布日期:2022-07-01

Text Semantic Enhancement Method Combining LDA and Word2vec

TANG Huanling, WEI Hongmin, WANG Yulin, ZHU Hui, DOU Quansheng

1.School of Computer Science and Technology, Shandong Technology and Business University, Yantai, Shandong 264005, China
2.Co-innovation Center of Shandong Colleges and Universities：Future Intelligent Computing, Yantai, Shandong 264005, China
3.Key Laboratory of Intelligent Information Processing in Universities of Shandong（Shandong Technology and Business University）, Yantai, Shandong 264005, China
4.School of Information and Electronic Engineering, Shandong Technology and Business University, Yantai, Shandong 264005, China
5.Shanghai Conversation Intelligence Co. Ltd., Shanghai 200120, China

Online:2022-07-01 Published:2022-07-01

摘要/Abstract

摘要： 文本的语义表示是自然语言处理和机器学习领域的研究难点，针对目前文本表示中的语义缺失问题，基于LDA主题模型和Word2vec模型，提出一种新的文本语义增强方法Sem2vec（semantic to vector）模型。该模型利用LDA主题模型获得单词的主题分布，计算单词与其上下文词的主题相似度，作为主题语义信息融入到词向量中，代替one-hot向量输入至Sem2vec模型，在最大化对数似然目标函数约束下，训练Sem2vec模型的最优参数，最终输出增强的语义词向量表示，并进一步得到文本的语义增强表示。在不同数据集上的实验结果表明，相比其他经典模型，Sem2vec模型的语义词向量之间的语义相似度计算更为准确。另外，根据Sem2vec模型得到的文本语义向量，在多种文本分类算法上的分类结果，较其他经典模型可以提升0.58%~3.5%，同时也提升了时间性能。

关键词: LDA主题模型, Word2vec模型, 语义词向量, 语义相似度, 文本分类

Abstract: Text semantic representation is one of the most difficulty problems in natural language processing and machine learning. To solve the problem of semantic loss in text representation, this paper proposes a new text semantic representation method named Sem2vec（semantic to vector） model which is based on the LDA topic model and the Word2vec model. The topic similarity is calculated according to the word topic distribution obtained by the LDA model. Then the topic semantic word vectors are inputted into the Sem2vec model instead of the one-hot vector. Constrained by maximizing log-likelihood objective function, the parameters of the Sem2vec model are optimized. Finally, the semantic word vectors are learned by the Sem2vec model and the semantic representation of the text is further obtained. The experimental results on different datasets show that compared with the other classic models, the Sem2vec model is more accurate in calculating semantic similarity between words. Moreover, in different classification algorithms, the text semantic vectors generated by the Sem2vec model can improve the text classification results by 0.58%~3.5% and promote the time performance compared with the other classic models.

Key words: LDA topic model, Word2vec model, semantic word vector, semantic similarity, text categorization

唐焕玲, 卫红敏, 王育林, 朱辉, 窦全胜. 结合LDA与Word2vec的文本语义增强方法[J]. 计算机工程与应用, 2022, 58(13): 135-145.

TANG Huanling, WEI Hongmin, WANG Yulin, ZHU Hui, DOU Quansheng. Text Semantic Enhancement Method Combining LDA and Word2vec[J]. Computer Engineering and Applications, 2022, 58(13): 135-145.

参考文献

[1] LIU G，GUO J.Bidirectional LSTM with attention mechanism and convolutional layer for text classification[J].Neuro-computing，2019，337：325-338.
[2] HOU S L，HUANG X K，FEI C Q，et al.A survey of text summarization approaches based on deep learning[J].Journal of Computer Science and Technology，2021，36（3）：633-663.
[3] 吴呈，王朝坤，王沐贤.基于文本化简的实体属性抽取方法[J].计算机工程与应用，2020，56（21）：115-122.
WU C，WANG C K，WANG M X.Entity attributes extraction based on text simplification[J].Computer Engineering and Applications，2020，56（21）：115-122.
[4] ZHU D Y，GUO Q，ZHANG D J，et al.Information extraction research review[C]//Proceedings of the 5th International Conference on Computer Science and Information Engineering，23-25 October，2020.
[5] BAWDEN R，SENNRICH R，BIRCH A，et al.Evaluating discourse phenomena in neural machine translation[C]//Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies，2018.
[6] 侯强，侯瑞丽.机器翻译方法研究与发展综述[J].计算机工程与应用，2019，55（10）：30-35.
HOU Q，HOU R L.Review of studies and developments on machine translation methodology[J].Computer Engineering and Applications，2019，55（10）：30-35.
[7] 唐焕玲，林正奎，鲁明羽.基于差异性评估对Co-training文本分类算法的改进[J].电子学报，2008，36（S1）：138-143.
TANG H L，LIN Z K，LU M Y.An improved Co-training text categorization algorithm based on diversity measures[J].Acta Electronica Sinica，2008，36（S1）：138-143.
[8] 江洋洋，金伯，张宝昌.深度学习在自然语言处理领域的研究进展[J].计算机工程与应用，2021，57（22）：1-14.
JIANG Y Y，JIN B，ZHANG B C.Research progress of natural language processing based on deep learning[J].Computer Engineering and Applications，2021，57（22）：1-14.
[9] TURNEY P D，PANTEL P.From frequency to meaning：vector space models of semantics[J].The Journal of Artificial Intelligence Research，2010，37（1）：141-188．
[10] 张志昌，曾扬扬，庞雅丽.融合语义角色和自注意力机制的中文文本蕴含识别[J].电子学报，2020，48（11）：2162-2169.
ZHANG Z C，ZENG Y Y，PANG Y L.A chinese textual entailment recognition method incorporating semantic role and self-attention[J].Acta Electronica Sinica，2020，48（11）：2162-2169.
[11] MIKOLOV T，CHEN K，CORRADO G，et al.Efficient estimation of word representations in vector space[J].arXiv：1301.3781，2013.
[12] 郭茂盛，张宇，刘挺.文本蕴含关系识别与知识获取研究进展及展望[J].计算机学报，2017，40（4）：889-910.
GUO M S，ZHANG Y，LIU T.Research advances and pro-spect of recognizing textual entailment and knowledge acquisition[J].Chinese Journal of Computers，2017，40（4）：889-910.
[13] HARRIS Z S.Distributional structure[J].Word，1954，10（2/3）：146-162.
[14] SALTON G，WONG A K，YANG C S.A vector space model for automatic indexing[J].Communications of the ACM，1975，18（11）：613-620.
[15] BLEI D M，NG A，JORDAN M.Latent Dirichlet allocation[J].Journal of Machine Learning Research，2003，3：993-1022.
[16] 唐焕玲，窦全胜，于立萍.有监督主题模型的SLDA-TC文本分类新方法[J].电子学报，2019，47（6）：1300-1308.
TANG H L，DOU Q S，YU L P.SLDA-TC：A novel text categorization approach based on supervised topic model[J].Acta Electronica Sinica，2019，47（6）：1300-1308.
[17] MIKOLOV T，SUTSKEVER I，CHEN K，et al.Distributed representations of words and phrases and their compositionality[J].Advances in Neural Information Processing Systems，2013，26：3111-3119.
[18] PENNINGTON J，SOCHER R，MANNING C D.Glove：Global vectors for word representation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing（EMNLP），2014：1532-1543.
[19] RADFORD A，NARASIMHAN K，SALIMANS T，et al.Improving language understanding by generative pre-training[EB/OL].（2018）[2021-12-24].https：//s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.
[20] RADFORD A，WU J，CHILD R，et al.Language models are unsupervised multitask learners[J].OpenAI，2019，1（8）：9.
[21] BROWN T B，MANN B，RYDER N，et al.Language models are few-shot learners[J].arXiv：2005.14165，2020.
[22] DEVLIN J，CHANG M W，LEE K，et al.BERT：Pre-training of deep bidirectional transformers for language under-standing[C]//Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics，Minneapolis，June 2-7，2019：4171-4186.

结合LDA与Word2vec的文本语义增强方法

Text Semantic Enhancement Method Combining LDA and Word2vec

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

[1]	郑诚, 陈杰, 董春阳. 结合图卷积的深层神经网络用于文本分类[J]. 计算机工程与应用, 2022, 58(7): 206-212.
[2]	曹东伟, 李邵梅, 陈鸿昶. 基于GCN的虚假评论检测方法[J]. 计算机工程与应用, 2022, 58(3): 181-186.
[3]	黄金杰，蔺江全，何勇军，何瑾洁，王雅君. 局部语义与上下文关系的中文短文本分类算法[J]. 计算机工程与应用, 2021, 57(6): 94-100.
[4]	霍光煜，张勇，孙艳丰，尹宝才. 基于语义的档案数据智能分类方法研究[J]. 计算机工程与应用, 2021, 57(6): 247-253.
[5]	郑诚，董春阳，黄夏炎. 基于BTM图卷积网络的短文本分类方法[J]. 计算机工程与应用, 2021, 57(4): 155-160.
[6]	贺文亮，朱敏玲. 胶囊神经网络研究现状与未来的浅析[J]. 计算机工程与应用, 2021, 57(3): 33-43.
[7]	滕金保，孔韦韦，田乔鑫，王照乾，李龙. 基于CNN和LSTM的多通道注意力机制文本分类模型[J]. 计算机工程与应用, 2021, 57(23): 154-162.
[8]	武书钊，李功权，卜明伟. 基于知识图谱的自杀倾向检测问答系统构建[J]. 计算机工程与应用, 2021, 57(22): 304-312.
[9]	李铁飞，生龙，吴迪. BERT-TECNN模型的文本分类方法研究[J]. 计算机工程与应用, 2021, 57(18): 186-193.
[10]	丁勇，程家桥，蒋翠清，王钊. 基于主题和关键词特征的比较文本分类方法[J]. 计算机工程与应用, 2021, 57(17): 196-202.
[11]	滕金保，孔韦韦，田乔鑫，王照乾. 基于LSTM-Attention与CNN混合模型的文本分类方法[J]. 计算机工程与应用, 2021, 57(14): 126-133.
[12]	乔伟涛，黄海燕，王珊. 基于Transformer编码器的语义相似度算法研究[J]. 计算机工程与应用, 2021, 57(14): 158-163.
[13]	翟一鸣，王斌君，周枝凝，仝鑫. 面向文本分类的多头注意力池化RCNN模型[J]. 计算机工程与应用, 2021, 57(12): 155-160.
[14]	姚佳奇，徐正国，燕继坤，王科人. GCN-PU:基于图卷积网络的PU文本分类算法[J]. 计算机工程与应用, 2021, 57(11): 162-167.
[15]	申艳光，贾耀清. 基于词共现与图卷积的文本分类方法[J]. 计算机工程与应用, 2021, 57(11): 173-178.