Comprehensive Review of Large Language Model Fine-Tuning

doi:10.3778/j.issn.1002-8331.2312-0035

Abstract

Abstract: The rise of large-scale language models signifies a new milestone in the field of deep learning, with fine-tuning techniques playing a crucial role in optimizing model performance. This paper provides a comprehensive overview of fine-tuning techniques for large-scale language models. It reviews the development stages of language models, including statistical language models, neural network language models, pre-trained language models, and large language models. The basic concepts of fine-tuning are explored, covering classic fine-tuning, efficient parameter fine-tuning, prompt tuning, and reinforcement learning fine-tuning. The paper delves into the principles and development of each fine-tuning technique, offering a comparative analysis across these four major categories. In conclusion, the paper summarizes the current state of research on fine-tuning techniques and underscores the potential research value in this domain, providing insights into future directions of development.

Key words: large language model, fine-tuning methods, pre-trained models, natural language processing

摘要： 大型语言模型的崛起是深度学习领域的全新里程碑，而微调技术在优化模型性能方面的起到了关键作用。对大型语言模型微调技术进行了全面的综述，回顾了语言模型的统计语言模型、神经网络语言模型、预训练语言模型和大语言模型四个阶段的发展历程和微调技术的基本概念，从经典参数微调、高效参数微调、提示微调和强化学习微调方法四大部分，探讨总结了各微调技术的原理与发展，并进行了一定的对比分析。最后，总结了当前微调技术的研究状况与发展重点，强调了该领域的潜在研究价值，并展望了未来的发展方向。

关键词: 大语言模型, 微调方法, 预训练模型, 自然语言处理

ZHANG Qintong, WANG Yuchao, WANG Hexi, WANG Junxin, CHEN Hai. Comprehensive Review of Large Language Model Fine-Tuning[J]. Computer Engineering and Applications, 2024, 60(17): 17-33.

张钦彤, 王昱超, 王鹤羲, 王俊鑫, 陈海. 大语言模型微调技术的研究综述[J]. 计算机工程与应用, 2024, 60(17): 17-33.

References

[1] YOSINSKI J, CLUNE J, BENGIO Y, et al. How transferable are features in deep neural networks?[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems, 2014: 3320-3328.
[2] HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network[J]. arXiv:1503.02531, 2015.
[3] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[EB/OL]. [2023-11-23]. https://www.mikecaptain.com/resources/pdf/GPT-1.pdf 2018.
[4] RADFORD A. Language models are unsupervised multitask learners[EB/OL]. [2023-11-23]. http://web.archive.org/web/20190226183542/https:/d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf.
[5] KAPLAN J, MCCANDLISH S, HENIGHAN T, et al. Scaling laws for neural language models[J]. arXiv:2001.08361, 2020.
[6] BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[J]. arXiv:2005.14165, 2020.
[7] OPENAI. GPT-4 technical report[J]. arXiv:2303.08774, 2023.
[8] SMITH L N. Cyclical learning rates for training neural networks[C]//Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), 2017: 464-472.
[9] CHUNG H W, HOU L, LONGPRE S, et al. Scaling instruction-finetuned language models[J]. arXiv:2210.11416, 2022.
[10] ZHANG S, DONG L, LI X, et al. Instruction tuning for large language models: a survey[J]. arXiv:2308.10792, 2023.
[11] HAN X, ZHANG Z, DING N, et al. Pre-trained models: past, present and future[J]. AI Open, 2021, 2: 225-250.
[12] QIU X, SUN T, XU Y, et al. Pre-trained models for natural language processing: a survey[J]. Science China Technolo-
gical Sciences, 2020, 63(10): 1872-1897.
[13] LIU P, YUAN W, FU J, et al. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing[J]. arXiv:2107.13586, 2021.
[14] DING N, QIN Y, YANG G, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models[J]. Nature Machine Intelligence, 2023, 5(3): 220-235.
[15] MANNING C, SCHUTZE H. Foundations of statistical natural language processing[M]. Cambridge, Massachusetts: MIT Press, 1999.
[16] ROSENFELD R. Two decades of statistical language modeling: where do we go from here?[J]. Proceedings of the IEEE, 2000, 88(8): 1270-1278.
[17] GAO J, LIN C Y. Introduction to the special issue on statistical language modeling[J]. ACM Transactions on Asian Language Information Processing (TALIP), 2004, 3(2): 87-93.
[18] GOODMAN J T. A bit of progress in language modeling[J]. Computer Speech & Language, 2001, 15(4): 403-434.
[19] BENGIO Y, DUCHARME R, VINCENT P, et al. A neural probabilistic language model[C]//Advances in Neural Information Processing Systems, 2000: 932-938.
[20] BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate[J]. arXiv:1409.0473, 2014.
[21] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[J]. arXiv:1301.3781, 2013.
[22] MIKOLOV T, KARAFIáT M, BURGET L, et al. Recurrent neural network based language modeling in meeting recognition[C]//Proceedings of the Annual Conference of the International Speech Communication Association, 2011: 2877-2880.
[23] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]//Advances in Neural Information Processing Systems, 2014: 3104-3112.
[24] DAI A M, LE Q V. Semi-supervised sequence learning[C]//Advances in Neural Information Processing Systems, 2015: 3079-3087.
[25] PETERS M, NEUMANN M, IYYER M, et al. Deep contextualized word representations[J]. arXiv:1802.05365, 2018.
[26] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017: 5998-6008.
[27] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[J]. arXiv:1810.04805, 2018.
[28] DING N, QIN Y, YANG G, et al. Delta tuning: a comprehensive study of parameter efficient methods for pre-trained language models[J]. arXiv:2203.06904, 2022.
[29] HUANG J, LI C, SUBUDHI K, et al. Few-shot named entity recognition: a comprehensive study[J]. arXiv:2012.14978, 2020.
[30] XIE Q, DAI Z, HOVY E, et al. Unsupervised data augmentation for consistency training[J]. arXiv:1904.12848, 2019.
[31] MCCANN B, BRADBURY J, XIONG C, et al. Learned in translation: contextualized word vectors[C]//Advances in Neural Information Processing Systems, 2017: 6294-6305.
[32] WANG Z, QU Y, CHEN L, et al. Label-aware double transfer learning for cross-specialty medical named entity recognition[J]. arXiv:1802.05365, 2018.
[33] LIU Y, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized bert pretraining approach[J]. arXiv:1907.11692, 2019.
[34] TOUVRON H, LAVRIL T, IZACARD G, et al. LLaMA: open and efficient foundation language models[J]. arXiv:2302.13971, 2023.
[35] TAORI R, GULRAJANI I, ZHANG T, et al. Alpaca: a strong, replicable instruction-following model[J]. Stanford Center for Research on Foundation Models, 2023, 3(6): 7.
[36] DU Z, QIAN Y, LIU X, et al. GLM: general language model pretraining with autoregressive blank infilling[J]. arXiv:2103.10360, 2021.
[37] SCAO T L, FAN A, AKIKISCAO C, et al. BLOOM: a 176b-parameter open-access multilingual language model[J]. arXiv:2211.05100, 2022.
[38] SUN X, JI Y, MA B, et al. A comparative study between full-parameter and LoRA-based fine-tuning on chinese instruction data for instruction following large language model[J]. arXiv:2304.08109, 2023
[39] SEBASTIAN R. Recent advances in language model fine-tuning[EB/OL]. [2023-11-23]. https://www.ruder.io/recent-advances-lm-fine-tuning/.
[40] GUNEL B, DU J, CONNEAU A, et al. Supervised contrastive learning for pre-trained language model fine-tuning[J]. arXiv:2011.01403, 2020.
[41] HOWARD J, RUDER S. Universal language model fine-tuning for text classification[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018), 2018: 328-339
[42] VíCTOR C, SPRECHMANN P, HANSEN S, et al. Beyond fine-tuning: transferring behavior in reinforcement learning[J]. arXiv:2102.13515, 2021.
[43] MALLADI S, GAO T, NICHANI E, et al. Fine-tuning language models with just forward passes[J]. arXiv:2305.17333, 2023.
[44] LV K, YANG Y, LIU T, et al. Full parameter fine-tuning for large language models with limited resources[J]. arXiv:2306.09782, 2023.
[45] PHOO C P, HARIHARAN B. Self-training for few-shot transfer across extreme task differences[J]. arXiv:2010.07734, 2020.
[46] LI S, CHEN D, CHEN Y, et al. Unsupervised Finetuning[J]. arXiv:2110.09510, 2021.
[47] XU Y, QIU X, ZHOU L, et al. Improving BERT fine-tuning via self-ensemble and self-distillation[J]. arXiv:2002.10345, 2020.
[48] ZHU C, CHENG Y, GAN Z, et al. FreeLB: enhanced adversarial training for natural language understanding[J]. arXiv:1909.11764, 2019.
[49] JIANG H, HE P, CHEN W, et al. Smart: robust and efficient fine-tuning for pre-trained natural language models through principled regularized optimization[J]. arXiv:1911.03437, 2019.
[50] YU Y, ZUO S, JIANG H, et al. Fine-tuning pre-trained language model with weak supervision: a contrastive-regularized self-training approach[J]. arXiv:2010.07835, 2020.
[51] TANWISUTH K, ZHANG S, ZHENG H, et al. POUF: prompt-oriented unsupervised fine-tuning for large pre-trained models[J]. arXiv:2305.00350, 2023.
[52] AGHAJANYAN A, ZETTLEMOYER L, GUPTA S. Intrinsic dimensionality explains the effectiveness of language model fine-tuning[J]. arXiv:2012.13255, 2020.
[53] HAN W, PANG B, WU Y. Robust transfer learning with pretrained language models through adapters[J]. arXiv:2108.02340, 2021.
[54] LEE J, YOON W, KIM S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining[J]. Bioinformatics, 2020, 36(4): 1234-1240.
[55] SEE A, LIU P J, Manning C D. Get to the point: summarization with pointer-generator networks[J]. arXiv:1704.04368, 2017.
[56] LEWIS M, LIU Y, GOYAL N, et al. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[J]. arXiv:1910.13461, 2019.
[57] BIDERMAN S, SCHOELKOPF H, ANTHONY Q, et al. Pythia: a suite for analyzing large language models across training and scaling[J]. arXiv:2304.01373, 2023.
[58] LI X L, LIANG P. Prefix-tuning: optimizing continuous prompts for generation[J]. arXiv:2101.00190, 2021.
[59] LIU H, TAM D, MUQEETH M, et al. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning[C]//Advances in Neural Information Processing Systems, 2022: 1950-1965.
[60] ZAKEN E B, RAVFOGEL S, GOLDBERG Y. BitFit: simple parameter-efficient fine-tuning for transformer-based masked language-models[J]. arXiv:2106.10199, 2021.
[61] GUO D, RUSH A M, KIM Y. Parameter-efficient transfer learning with diff pruning[J]. arXiv:2012.07463, 2020.
[62] HU E J, SHEN Y, WALLIS P, et al. LoRA: low-rank adaptation of large language models[J]. arXiv:2106.09685, 2021.
[63] LI C, FARKHOOR H, LIU R, et al. Measuring the intrinsic dimension of objective landscapes[J]. arXiv:1804.08838, 2018.
[64] BACH F R, JORDAN M I. Predictive low-rank decomposition for kernel methods[C]//Proceedings of the 22nd International Conference on Machine Learning, 2005: 33-40.
[65] CHEN Y K, QIAN S J, TANG H T, et al. LongLoRA: Efficient fine-tuning of long-context large language models[J]. arXiv:2309.12307, 2023.
[66] CHAVAN A, LIU Z, GUPTA D, et al. One-for-all: generalized LoRA for parameter-efficient fine-tuning[J]. arXiv:2306.07967, 2023.
[67] ZHANG Q, CHEN M, BUKHARIN A, et al. Adaptive budget allocation for parameter-efficient fine-tuning[J]. arXiv:2303.10512, 2023.
[68] LUO M, XU X, LIU Y, et al. In-context learning with retrieved demonstrations for language models: a survey[J]. arXiv:2401.11624, 2024.
[69] RAZEGHI Y, LOGAN IV R L, GARDNER M, et al. Impact of pretraining term frequencies on few-shot reasoning[J]. arXiv:2202.07206, 2022.
[70] XIE S M, RAGHUNATHAN A, LIANG P, et al. An explanation of in-context learning as implicit bayesian inference[J]. arXiv:2111.02080, 2021.
[71] LIU J, SHEN D, ZHANG Y, et al. What makes good in-context examples for GPT-3?[J]. arXiv:2101.06804, 2021.
[72] HOLTZMAN A, WEST P, SCHWARTZ V, et al. Surface form competition: why the highest probability answer isn’t always right[J]. arXiv:2104.08315, 2021.
[73] ZHAO T Z, WALLACE E, FENG S, et al. Calibrate before use: improving few-shot performance of language models[J]. arXiv:2102.09690, 2021.
[74] WEI J, WANG X, SCHUURMANS D, et al. Chain of thought prompting elicits reasoning in large language models[J]. arXiv:2201.11903, 2022.
[75] QIAO S, OU Y, ZHANG N, et al. Reasoning with language model prompting: a survey[J]. arXiv:2212.09597, 2022.
[76] CHEN W H, MA X G, WANG X Y, et al. Program of thoughts prompting: disentangling computation from reasoning for numerical reasoning tasks[J]. arXiv:2211.12588, 2022
[77] LONG J Y. Large language model guided tree-of-thought[J]. arXiv:2305.08291, 2023.
[78] NING X F, LIN Z N, ZHOU Z X, et al. Skeleton-of-thought: Large language models can do parallel decoding[J]. arXiv:2307.15337, 2023.
[79] BESTA M, BLACH N, KUBICEK A, et al. Graph of thoughts: solving elaborate problems with large language models[J]. arXiv:2308.09687, 2023.
[80] LEI B, LIN P H, LIAO C, et al. Boosting logical reasoning in large language models through a new framework: the graph of thought [J]. arXiv:2308.08614, 2023.
[81] 林令德, 刘纳, 王正安. Adapter与Prompt Tuning微调方法研究综述[J]. 计算机工程与应用, 2023, 59(2): 12-21.
LIN L D, LIU N, WANG Z A. Review of research on Adapter and Prompt Tuning[J]. Computer Engineering and Applications, 2023, 59(2): 12-21.
[82] SHIN T, RAZEGHI Y, LOGAN I R L, et al. Autoprompt: eliciting knowledge from language models with automatically generated prompts[J]. arXiv:2010.15980, 2020.
[83] GAO T, FISCH A, CHEN D. Making pre-trained language models better few-shot learners[J]. arXiv:2012.15723, 2020.
[84] LIU X, ZHENG Y, DU Z, et al. GPT understands, too[J]. arXiv:2103.10385, 2021.
[85] LESTER B, AL-RFOU R, CONSTANT N. The power of scale for parameter-efficient prompt tuning[J]. arXiv:2104.08691, 2021.
[86] QIN G, EISNER J. Learning how to ask: querying LMs with mixtures of soft prompts[J]. arXiv:2104.06599, 2021.
[87] LONGPRE S, HOU L, VU T, et al. The flan collection: designing data and methods for effective instruction tuning[J]. arXiv:2301.13688, 2023.
[88] SANH V, WEBSON A, RAFFEL C, et al. Multitask prompted training enables zero-shot task generalization[J]. arXiv:2110.
08207, 2021.
[89] XUE F Z, JAIN K, SHAH M H, et al. Instruction in the wild: a user-based instruction dataset[EB/OL]. [2023-11-23]. https://github.com/XueFuzhao/InstructionWild.
[90] WANG Y Z, MISHRA S, ALIPOORMOLABASHI P, et al. Super-naturalinstructions: generalization via declarative instructions on 1600+ NLP tasks[J]. arXiv:2204.07705, 2022.
[91] MUENNIGHOFF N, WANG T, SUTAWIKA L, et al. Crosslingual generalization through multitask finetuning[J]. arXiv:2211.01786, 2022.
[92] DING N, CHEN Y, XU B, et al. Enhancing chat language models by scaling high-quality instructional conversations[J]. arXiv:2305.14233, 2023.
[93] YAO S Y, YU D, ZHAO J, et al. Tree of thoughts: deliberate problem solving with large language models[J]. arXiv:2305.
10601, 2023.
[94] XU Z Y, SHEN Y, HUANG L F. Multiinstruct: improving multi-modal zero shot learning via instruction tuning[J]. arXiv:2212.10773, 2022.
[95] BARAL C, YANG Y Z, BLANC E, et al. Towards development of models that learn new tasks from instructions[D]. Phoenix City: Arizona State University, 2023.
[96] MARTIN A, ASHIISH A, PAUL B, et al. Tensorflow: large-scale machine learning on heterogeneous distributed systems[J]. arXiv:1603.04467, 2016.
[97] OUYANG L, WU J, JIANG X, et al. Training language models to follow instructions with human feedback[J]. arXiv:2203.
02155, 2022.
[98] BAI Y, JONES A, NDOUSSE K, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback[J]. arXiv:2204.05862, 2022.
[99] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms[J]. arXiv:1707.06347, 2017.
[100] BAI Y T, KADAVATH S, KUNDU S, et al. Constitutional AI: harmlessness from AI feedback. 2022[J]. arXiv:2212.
08073, 2022.
[101] LEE H, PHATALE S, MANSOOR H, et al. RLAIF: scaling reinforcement learning from human feedback with ai feedback[J]. arXiv:2309.00267, 2023.
[102] WU Z X, LIU N F, POTTS C. Identifying the limits of cross-domain knowledge transfer for pretrained models[J]. arXiv:2104.08410, 2021.
[103] QI X, ZENG Y, XIE T, et al. Fine-tuning aligned language models compromises safety, even when users do not intend to![J]. arXiv:2310.03693, 2023.
[104] HE J, CHEN J, HE S, et al. AdaMix: mixture-of-adaptations for parameter-efficient model tuning[J]. arXiv:2205.09717, 2022.
[105] ZHAO W X, ZHOU K, LI J, et al. A survey of large language models[J]. arXiv:2303.18223, 2023.
[106] HOULSBY N, GIURGIU A, JASTRZEBSKI S, et al. Parameter-efficient transfer learning for NLP[J]. arXiv:1902.00751, 2019.
[107] WANG A, SINGH A, HILL F, et al. GLUE: a multi-task benchmark and analysis platform for natural language understanding[J]. arXiv:1804.07461, 2018.
[108] HE R, LIU L, YE H, et al. On the effectiveness of adapter-based tuning for pretrained language model adaptation[J]. arXiv:2106.03164, 2021.
[109] YANG H, LI P, LAM W. Parameter-efficient tuning by manipulating hidden states of pretrained language models for classification tasks[J]. arXiv:2204.04596, 2022.
[110] HE P, LIU X, GAO J, et al. DeBERTa: decoding-enhanced BERT with disentangled attention[J]. arXiv:2006.03654, 2020.
[111] ZHAI X, PUIGCERVER J, KOLESNIKOV A, et al. A large-scale study of representation learning with the visual task adaptation benchmark[J]. arXiv:1910.04867, 2019.
[112] BANSAL M, KUMAR M, SACHDEVA M, et al. Transfer learning for image classification using VGG19: Caltech-101 image data set[J]. Journal of Ambient Intelligence and Humanized Computing, 2023, 14(4): 3609-3620.
[113] HELBER P, BISCHKE B, DENGEL A, et al. EuroSAT: a novel dataset and deep learning benchmark for land use and land cover classification[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019, 12(7): 2217-2226.
[114] JOHNSON J, HARIHARAN B, MAATEN L V D, et al. Clevr: a diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 2901-2910.
[115] WILLIAMS A, NANGIA N, BOWMAN S R. A broad-coverage challenge corpus for sentence understanding through inference[J]. arXiv:1704.05426, 2017.
[116] WOLF T, DEBUT L, SANH V, et al. Transformers: state-of-the-art natural language processing[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020: 38-45.
[117] HE J, ZHOU C, MA X, et al. Towards a unified view of parameter-efficient transfer learning[J]. arXiv:2110.04366, 2021.
[118] CHRISTIANO P, LEIKE J, BROWN T B, et al. Deep reinforcement learning from human preferences[J]. arXiv:1706.
03741, 2017.
[119] KINGMA D P, BA J. ADAM: a method for stochastic optimization[J]. arXiv:1412.6980, 2014.
[120] ZIEGLER D M, STIENNON N, WU J, et al. Fine-tuning language models from human preferences[J]. arXiv:1909.
08593, 2019.
[121] GANESAN K. Rouge 2.0: updated and improved measures for evaluation of summarization tasks[J]. arXiv:1803.01937, 2018.
[122] TOUVRON H, MARTIN L, STONE K, et al. LLaMA 2: open foundation and fine-tuned chat models[J]. arXiv:2307.
09288, 2023.
[123] CASPER S, DAVIES X, SHI C, et al. Open problems and fundamental limitations of reinforcement learning from human feedback[J]. arXiv:2307.15217, 2023.