Review of Research on Adapter and Prompt Tuning

doi:10.3778/j.issn.1002-8331.2209-0025

Abstract

Abstract: Text mining is a branch of data mining, covering a variety of technologies, among which natural language processing technology is one of the core tools of text mining, which aims to help users obtain useful information from massive data. In recent years, the pre-training model has played an important role in promoting the research and development of natural language processing, and the fine-tuning method of the pre-training model has also become an important research field. On the basis of the relevant literature on the pre-training model fine-tuning method published in recent years, this paper reviews the current mainstream Adapter and Prompt methods. First of all, the development of natural language processing is briefly combed, and the problems and difficulties in fine-tuning of pre-training models are analyzed. Secondly, two kinds of fine-tuning methods：Adapter and Prompt, and the classic methods in the this two research directions are introduced. The advantages, disadvantages and performance are analyzed and summarized. Finally, this paper summarizes the limitations of the current fine-tuning methods of the pre-training model and discusses the future development direction.

Key words: text mining, natural language processing, deep learning, pre-trained models, fine-tuning methods

摘要： 文本挖掘是数据挖掘的一个分支学科，涵盖多种技术，其中自然语言处理技术是文本挖掘的核心工具之一，旨在帮助用户从海量数据中获取有用的信息。近年来，预训练模型对自然语言处理的研究和发展有重要的推动作用，预训练模型的微调方法也成为重要的研究领域。根据近年来预训练模型微调方法的相关文献，选择目前主流的Adapter与Prompt微调方法进行介绍。对自然语言处理的发展脉络进行简要梳理，分析目前预训练模型微调存在的问题与不足；介绍Adapter与Prompt两类微调方法，对两个研究方向中经典方法进行介绍，并从优缺点和性能等方面进行详细分析；进行总结归纳，阐述目前预训练模型的微调方法存在的局限性并讨论未来发展方向。

关键词: 文本挖掘, 自然语言处理, 深度学习, 预训练模型, 微调方法

LIN Lingde, LIU Na, WANG Zheng'an. Review of Research on Adapter and Prompt Tuning[J]. Computer Engineering and Applications, 2023, 59(2): 12-21.

林令德, 刘纳, 王正安. Adapter与Prompt Tuning微调方法研究综述[J]. 计算机工程与应用, 2023, 59(2): 12-21.

References

[1] KALCHBRENNER N，GREFENSTETTE E，BLUNSOM P.A convolutional neural network for modelling sentences[C]//Annual Meeting of the Association for Computational Linguistics，2014：1-11.
[2] KIM Y.Convolutional neural networks for sentence classification[C]//Conference on Empirical Methods in Natural Language Processing，2014：1746-1751.
[3] GEHRING J，AULI M，GRANGIER D，et al.Convolutional sequence to sequence learning[C]//International Conference on Machine Learning，2017：1243-1252.
[4] SUTSKEVER I，VINYALS O，LE Q V.Sequence to sequence learning with neural networks[C]//Conference and Workshop on Neural Information Processing Systems，2014：3104-3112.
[5] LIU P F，QIU X P，HUANG X J.Recurrent neural network for text classification with multi-task learning[C]//International Joint Conferences on Artificial Intelligence，2016：1-7.
[6] SOCHER R，PERELYGIN A，WU J Y，et al.Recursive deep models for semantic compositionality over a sentiment treebank[C]//2013 Conference on Empirical Methods in Natural Language Processing，2013：1-12.
[7] TAI K S，SOCHER R，MANNING C D.Improved semantic representations from tree-structured long short-term memory networks[C]//International Joint Conference on Natural Language Processing，2015：1-11.
[8] MARCHEGGIANI D，BASTINGS J，TITOV I.Exploiting semantics in neural machine translation with graph convolutional networks[C]//Annual Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies，2018：486-492.
[9] QIU X，SUN T，XU Y，et al.Pre-trained models for natural language processing：a survey[J].Science China Technological Sciences，2020，63（10）：1872-1897.
[10] PETERS M，NEUMANN M，IYYER M，et al.Deep contextualized word representations[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies，2018：1-15.
[11] RADFORD A，NARASIMHAN K，SALIMANS T，et al.Improving language understanding by generative pre-training[EB/OL].[2020-09-26].https：//s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.
[12] DEVLIN J，CHANG M，LEE K，et al.BERT：pre-training of deep bidirectional transformers for language understanding[J].arXiv：1810.04805，2018.
[13] YANG Z，DAI Z，YANG Y，et al.XLNet：generalized auto regressive pretraining for language understanding[C]//Proceedings of the 32nd Annual Conference on Neural Information Processing Systems，Vancouver，Dec 8-14，2019：5754-5764.
[14] CLARK K，LUONG M，LE Q V，et al.Electra：pre-training text encoders as discriminators rather than generators[J].arXiv：2003.10555，2020.
[15] LAN Z，CHEN M，GOODMAN S，et al.ALBERT：a lite BERT for self-supervised learning of language representations[J].arXiv：1909.11942，2019.
[16] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[C]//Advances in Neural Information Processing Systems，2017：5998-6008.
[17] SMITH S，PATWARY M，NORICK B，et al.Using deepSpeed and megatron to train megatron-Turing NLG 530B，a large-scale generative language model[J].arXiv：2201.
11990，2022.
[18] SUN Y，WANG S，FENG S，et al.ERNIE 3.0：large-scale knowledge enhanced pre-training for language understanding and generation[J].arXiv：2107.02137，2021.
[19] WU S，ZHAO X，YU T，et al.Yuan 1.0：large-scale pre-trained language model in zero-shot and few-shot learning[J].arXiv：2110.04725，2021.
[20] YUE Z，ZHANG H，SUN Q，et al.Interventional few-shot learning[C]//Advances in Neural Information Processing Systems，2020：2734-2746.
[21] HAN X，ZHANG Z，DING N，et al.Pre-trained models：past，present and future[J].AI Open，2021：225-250.
[22] LIU P，YUAN W，FU J，et al.Pre-train，prompt，and predict：a systematic survey of prompting methods in natural language processing[J].arXiv：2107.13586，2021.
[23] HOULSBY N，GIURGIU A，JASTRZEBSKI S，et al.Parameter-efficient transfer learning for NLP[C]//International Conference on Machine Learning，2019：2790-2799.
[24] MCCLOSKEY M，COHEN N J.Catastrophic interference in connectionist networks：the sequential learning problem[M]//Psychology of learning and motivation.[S.l.]：Academic Press，1989：109-165.
[25] FRENCH R M.Catastrophic forgetting in connectionist networks[J].Trends in Cognitive Sciences，1999，3（4）：128-135.
[26] PFEIFFER J，KAMATH A，RüCKLé A，et al.AdapterFusion：non-destructive task composition for transfer learning[J].arXiv：2005.00247，2020.
[27] WANG A，SINGH A，MICHAEL J，et al.GLUE：a multi-task benchmark and analysis platform for natural language understanding[J].arXiv：1804.07461，2018.
[28] SOCHER R，PERELYGIN A，WU J，et al.Recursive deep models for semantic compositionality over a sentiment treebank[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing，2013：1631-1642.
[29] HUANG J，TANG D，SHOU L，et al.Cosqa：20，000+ web queries for code search and question answering[J].arXiv：2105.13239，2021.
[30] RüCKLé A，GEIGLE G，GLOCKNER M，et al.Adapterdrop：on the efficiency of adapters in transformers[J].arXiv：2010.11918，2020.
[31] BAPNA A，ARIVAZHAGAN N，FIRAT O.Simple，scalable adaptation for neural machine translation[J].arXiv：1909.
08478，2019.
[32] WANG R，TANG D，DUAN N，et al.K-Adapter：infusing knowledge into pre-trained models with adapters：ACL/IJCNLP 2021[J].arXiv：2002.01808，2020.
[33] PFEIFFER J，VULI? I，GUREVYCH I，et al.Mad-x：an adapter-based framework for multi-task cross-lingual transfer[J].arXiv：2005.00052，2020.
[34] PHILIP J，BERARD A，GALLé M，et al.Monolingual adapters for zero-shot neural machine translation[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing（EMNLP），2020：4465-4470.
[35] STICKLAND A C，MURRAY I.Bert and pals：projected attention layers for efficient adaptation in multi-task learning[C]//International Conference on Machine Learning，2019：5986-5995.
[36] LAUSCHER A，MAJEWSKA O，RIBEIRO L F R，et al.Common sense or world knowledge? investigating adapter-based knowledge injection into pretrained transformers[J].arXiv：2005.11787，2020.
[37] üSTüN A，BISAZZA A，BOUMA G，et al.UDapter：language adaptation for truly universal dependency[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing，2020：2302-2315.
[38] VIDONI M，VULI? I，GLAVA? G.Orthogonal language and task adapters in zero-shot cross-lingual transfer[J].arXiv：2012.06460，2020.
[39] MAHABADI R K，RUDER S，DEHGHANI M，et al.Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks[J].arXiv：2106.04489，2021.
[40] PETRONI F，ROCKT?SCHEL T，RIEDEL S，et al.Language models as knowledge bases?[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing，2019：2463-2473.
[41] BROWN T，MANN B，RYDER N，et al.Language models are few-shot learners[C]//Advances in Neural Information Processing Systems，2020：1877-1901.
[42] SCHICK T，SCHüTZE H.Exploiting cloze questions for few shot text classification and natural language inference[J].arXiv：2001.07676，2020.
[43] GAGE P.A new algorithm for data compression[J].C Users Journal，1994，12（2）：23-38.
[44] SCHICK T，SCHüTZE H.It’s not just size that matters：small language models are also few-shot learners[J].arXiv：2009.07118，2020.
[45] SHIN T，RAZEGHI Y，LOGAN IV R L，et al.Autoprompt：eliciting knowledge from language models with automatically generated prompts[J].arXiv：2010.15980，2020.
[46] GAO T，FISCH A，CHEN D.Making pre-trained language models better few-shot learners[J].arXiv：2012.15723，2020.
[47] RAFFEL C，SHAZEER N，ROBERTS A，et al.Exploring the limits of transfer learning with a unified text-to-text transformer[J].J Mach Learn Res，2020：1-140.
[48] SCHICK T，SCHMID H，SCHüTZE H.Automatically identifying words that can serve as labels for few-shot text classification[J].arXiv：2010.13641，2020.
[49] SUN Y，ZHENG Y，HAO C，et al.NSP-BERT：a prompt-based zero-shot learner through an original pre-training task--next sentence prediction[J].arXiv：2109.03564，2021.
[50] LIU Y，OTT M，GOYAL N，et al.Roberta：a robustly optimized bert pretraining approach[J].arXiv：1907.11692，2019.
[51] YUAN W，NEUBIG G，LIU P.Bartscore：evaluating generated text as text generation[C]//Advances in Neural Information Processing Systems，2021：27263-27277.
[52] HAVIV A，BERANT J，GLOBERSON A.BERTese：learning to speak to BERT[J].arXiv：2103.05327，2021.
[53] LI X L，LIANG P.Prefix-tuning：optimizing continuous prompts for generation[J].arXiv：2101.00190，2021.
[54] LIU X，ZHENG Y，DU Z，et al.GPT understands，too[J].arXiv：2103.10385，2021.
[55] ALLEN-ZHU Z，LI Y，SONG Z.A convergence theory for deep learning via over-parameterization[C]//International Conference on Machine Learning，2019：242-252.
[56] ZHANG N，LI L，CHEN X，et al.Differentiable prompt makes pre-trained language models better few-shot learners[J].arXiv：2108.13161，2021.
[57] LESTER B，AL-RFOU R，CONSTANT N.The power of scale for parameter-efficient prompt tuning[C]//2021 Conference on Empirical Methods in Natural Language Processing，2021：3045-3059.
[58] HAMBARDZUMYAN K，KHACHATRIAN H，MAY J.WARP：word-level adversarial reprogramming[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing，2021：4921-4933.

[59] ELSAYED G F，GOODFELLOW I，SOHL-DICKSTEIN J.Adversarial reprogramming of neural networks[J].arXiv：1806.11146，2018.

[60] LIU X，JI K，FU Y，et al.P-Tuning v2：prompt tuning can be comparable to fine-tuning universally across scales and tasks[J].arXiv：2110.07602，2021.