基于深度学习的关键词生成研究综述

doi:10.3778/j.issn.1002-8331.2111-0580

摘要/Abstract

摘要： 关键词生成是自然语言处理中一项经典但具有挑战性的任务，需要从文档中自动生成一组具有代表性和特征性的词语。基于深度学习的序列到序列模型在这项任务中取得了显著的效果，弥补了以往关键词抽取存在的一个严重缺陷：无法产生不存在于原文中的关键词。由于其产生的结果更切合实际，关键词生成方法逐渐超越了以往的抽取方法，成为了关键词提取任务的主流方法。介绍了关键词提取的发展历程以及关键词生成任务的主要数据集，对基础设计采用序列到序列模型的关键词生成方法进行了分类梳理，分析其原理和优缺点。概述了关键词生成任务的评价方法，并对其未来研究重点进行了展望。

关键词: 关键词生成, 深度神经网络, Seq2Seq, 注意力机制

Abstract: Keyphrase generation is a classic but challenging task in natural language processing. It is necessary to automatically generate a set of representative and characteristic words from documents. The sequence-to-sequence model based on the deep learning has achieved remarkable results in this task, and it has made up for a serious shortcoming of keyphrase extraction in the past：it cannot generate keyphrase that do not exist in the original text. Because the results produced are more realistic, the keyphrase generation method has gradually surpassed the previous extraction methods and has become the mainstream method for keyphrase extraction tasks. This article first introduces the development process of keyphrase extraction and the main data sets of keyphrase generation tasks, and then classifies and sorts out the basic design of the keyphrase generation method using sequence-to-sequence model, and analyzes its principles, advantages and disadvantages. Finally, the evaluation method of the keyphrase generation task is summarized, and its future research focus is prospected.

Key words: keyphrase generation, deep neural network, Seq2Seq, attention mechanism

于强, 林民, 李艳玲. 基于深度学习的关键词生成研究综述[J]. 计算机工程与应用, 2022, 58(14): 27-39.

YU Qiang, LIN Min, LI Yanling. Review of Keyphrase Generation Based on Deep Learning[J]. Computer Engineering and Applications, 2022, 58(14): 27-39.

参考文献

[1] 胡少虎，张颖怡，章成志.关键词提取研究综述[J].数据分析与知识发现，2020，5（3）：45-59.
HU S H，ZHANG Y Y，ZHANG C Z.Review of keyword extraction studies[J].Data Analysis and Knowledge Discovery，2020，5（3）：45-59.
[2] LUHN H P.A statistical approach to mechanized encoding and searching of literary information[J].IBM Journal of Research and Development，1957，1（4）：309-317.
[3] MENG R，ZHAO S，HAN S，et al.Deep keyphrase generation[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics（Volume 1：Long Papers），2017：582-592.
[4] YUAN X，WANG T，MENG R，et al.One size does not fit all：generating and evaluating variable number of keyphrases[C]//58th Annual Meeting of the Association for Computational Linguistics，2020.
[5] CHO K，VAN MERRIENBOER B，GULCEHRE C，et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[J].arXiv：1406.1078，2014.
[6] CHEN W，CHAN H P，LI P，et al.Exclusive hierarchical decoding for deep keyphrase generation[J].arXiv：2004.
08511，2020.
[7] GU J，LU Z，LI H，et al.Incorporating copying mechanism in sequence-to-sequence learning[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics（Volume 1：Long Papers），2016：1631-1640.
[8] HULTH A.Improved automatic keyword extraction given more linguistic knowledge[C]//Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing，2003.
[9] NGUYEN T D，KAN M Y.Keyphrase extraction in scientific publications[C]//International Conference on Asian Digital Libraries.Berlin，Heidelberg：Springer，2007：317-326.
[10] KIM S N，MEDELYAN O，KAN M Y，et al.Automatic keyphrase extraction from scientific articles[J].Language Resources and Evaluation，2013，47（3）：723-742.
[11] KRAPIVIN M，AUTAYEU A，MARCHESE M，et al.Keyphrases extraction from scientific documents：improving machine learning approaches with natural language processing[M]//CHOWDHURY G，KOO C，HUNTER J，ed.The role of digital libraries in a time of global change.Berlin，Heidelberg：Springer，2010：102-111.
[12] WAN X，XIAO J.Single document keyphrase extraction using neighborhood knowledge[C]//Proceedings of the 23rd National Conference on Artificial Intelligence-Volume 2.Chicago，Illinois：AAAI Press，2008：855-860.
[13] GALLINA Y，BOUDIN F，DAILLE B.KPTimes：a large-scale dataset for keyphrase generation on news documents[J].arXiv：1911.12559，2019.
[14] WANG Y，LI J，CHAN H P，et al.Topic-aware neural keyphrase generation for social media language[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics，2019：2516-2526.
[15] CANO E，BOJAR O.Keyphrase generation：a multi-aspect survey[C]//2019 25th Conference of Open Innovations Association（FRUCT），2019：85-94.
[16] MIKOLOV T，CHEN K，CORRADO G，et al.Efficient estimation of word representations in vector space[J].arXiv：1301.3781，2013.
[17] DAHLMEIER D，NG H T.A beam-search decoder for grammatical error correction[C]//Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning，2012：568-578.
[18] WISEMAN S，RUSH A M.Sequence-to-sequence learning as beam-search optimization[J].arXiv：1606.02960，2016.
[19] ZHANG Y，FANG Y，WEIDONG X.Deep keyphrase generation with a convolutional sequence to sequence model[C]//2017 4th International Conference on Systems and Informatics（ICSAI），2017：1477-1485.
[20] ZHANG Y，XIAO W.Keyphrase generation based on deep seq2seq model[J].IEEE Access，2018，6：46047-46057.
[21] TU Z，LU Z，LIU Y，et al.Modeling coverage for neural machine translation[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics（Volume 1：Long Papers），2016：76-85.
[22] CHEN J，ZHANG X，WU Y，et al.Keyphrase generation with correlation constraints[J].arXiv：1808.07185，2018.
[23] MIAO Y，GREFENSTETTE E，BLUNSOM P.Discovering discrete latent topics with neural variational inference[C]//Proceedings of the 34th International Conference on Machine Learning，2017：2410-2419.
[24] ZENG J，LI J，SONG Y，et al.Topic memory networks for short text classification[J].arXiv：1809.03664，2018.
[25] YE H，WANG L.Semi-supervised learning for neural keyphrase generation[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing，2018：4142-4153.
[26] CHEN W，GAO Y，ZHANG J，et al.Title-guided encoding for keyphrase generation[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2019：6268-6275.
[27] CHEN W，CHAN H P，LI P，et al.An integrated approach for keyphrase generation via exploring the power of retrieval and extraction[C]//Proceedings of the 2019 Conference of the North，2019：2846-2856.
[28] ZHAO J，BAO J，WANG Y，et al.SGG：learning to select，guide，and generate for keyphrase generation[J].arXiv：2105.02544，2021.
[29] LIU R，LIN Z，WANG W.Keyphrase prediction with pre-trained language model[J].arXiv：2004.10462，2020.
[30] DEVLIN J，CHANG M W，LEE K，et al.BERT：pre-training of deep bidirectional transformers for language understanding[J].arXiv：1810.04805，2018.
[31] AHMAD W U，BAI X，LEE S，et al.Select，extract and generate：neural keyphrase generation with layer-wise coverage attention[J].arXiv：2008.01739，2020.
[32] ALZAIDY R，CARAGEA C，GILES C L.Bi-LSTM-CRF sequence labeling for keyphrase extraction from scholarly documents[C]//The World Wide Web Conference on-WWW’19.San Francisco，CA，USA：ACM Press，2019：2551-2557.
[33] XU Y，LUO Y，ZHOU Y，et al.Searching effective transformer for seq2seq keyphrase generation[C]//Natural Language Processing and Chinese Computing.Cham：Springer International Publishing，2021：86-97.
[34] SHEN X，WANG Y，MENG R，et al.Unsupervised deep keyphrase generation[J].arXiv：2104.08729，2021.
[35] HASAN K S，NG V.Conundrums in unsupervised keyphrase extraction：making sense of the state-of-the-art[C]//COLING 2010：Posters，2010：365-373.
[36] BENNANI-SMIRES K，MUSAT C，HOSSMANN A，et al.Simple unsupervised keyphrase extraction using sentence embeddings[J].arXiv：1801.04470，2018.
[37] CHAN H P，CHEN W，WANG L，et al.Neural keyphrase generation via reinforcement learning with adaptive rewards[J].arXiv：1906.04106，2019.
[38] SEE A，LIU P J，MANNING C D.Get to the point：summarization with pointer-generator networks[J].arXiv：1704.
04368，2017.
[39] LUO Y，XU Y，YE J，et al.Keyphrase generation with fine-grained evaluation-guided reinforcement learning[J].arXiv：2104.08799，2021.
[40] VOORHEES R.Competency-based learning models：a necessary future[J].New Directions for Institutional Research，2001：5-13.
[41] BAHULEYAN H，ASRI L E.Diverse keyphrase generation with neural unlikelihood training[J].arXiv：2010.07665，2020.
[42] HABIBI M，POPESCU-BELIS A.Diverse keyword extraction from conversations[C]//Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics（Volume 2：Short Papers），2013：651-657.
[43] ZESCH T，GUREVYCH I.Approximate matching for evaluating keyphrase extraction[C]//Proceedings of the International Conference RANLP-2009，2009：484-489.
[44] YE J，GUI T，LUO Y，et al.One2Set：generating diverse keyphrases as a set[J].arXiv：2105.11134，2021.
[45] MENG R，YUAN X，WANG T，et al.An empirical study on neural keyphrase generation[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies，2021：4985-5007.
[46] 祖弦，谢飞，刘啸剑.融合词和文档嵌入的关键词抽取算法[J].计算机科学与探索，2021，15 （2）：294-304.
ZU X，XIE F，LIU X J.Keyphrase extraction combining word and document embeddings[J].Journal of Frontiers of Computer Science and Technology，2021，15（2）：294-304.
[47] 樊玮，刘欢，张宇翔.融合词向量与位置信息的关键词提取算法[J].计算机工程与应用，2020，56（5）：179-185.
FAN W，LIU H，ZHANG Y X.Keyphrase extraction algorithm integrating word embeddings and position information[J].Computer Engineering and Applications，2020，56（5）：179-185.
[48] 曾庆田，胡晓慧，李超.融合主题词嵌入和网络结构分析的主题关键词提取方法[J].数据分析与知识发现，2019（7）：52-60.
ZENG Q T，HU X H，LI C.Keyword extraction method based on keyword embedding and network structure analysis[J].Data Analysis and Knowledge Discovery，2019（7）：52-60.
[49] 黄佳佳，李鹏伟，彭敏，等.基于深度学习的主题模型研究[J].计算机学报，2020，43（5）：827-855.
HUANG J J，LI P W，PENG M，et al.Review of deep learning-based topic model[J].Chinese Journal of Computers，2020，43（5）：827-855.
[50] 李慧，田亚丹.一种层次化的科学知识结构发现方法[J].图书情报工作，2018，62（13）：92-102.
LI H，TIAN Y D.A hierarchical discovery method of scientific knowledge structure[J].Library and Information Service，2018，62（13）：92-102.
[51] HOYLE A M，GOEL P，RESNIK P.Improving neural topic models using knowledge distillation[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing（EMNLP），2020：1752-1771.
[52] ZOU X.A survey on application of knowledge graph[J].Journal of Physics：Conference Series，2020，1487（1）：012016.