Review of Development of Deep Learning Optimizer

doi:10.3778/j.issn.1002-8331.2307-0370

Abstract

Abstract: Optimization algorithms are the most critical factor in improving the performance of deep learning models, achieved by minimizing the loss function. Large language models (LLMs), such as GPT, have become the research focus in the field of natural language processing, the optimization effect of traditional gradient descent algorithm has been limited. Therefore, adaptive moment estimation algorithms have emerged, which are significantly superior to traditional optimization algorithms in generalization ability. Based on gradient descent, adaptive gradient, and adaptive moment estimation algorithms, and the pros and cons of optimization algorithms are analyzed. This paper applies optimization algorithms to the Transformer architecture and selects the French-English translation task as the evaluation benchmark. Experiments have shown that adaptive moment estimation algorithms can effectively improve the performance of the model in machine translation tasks. Meanwhile, it discusses the development direction and applications of optimization algorithms.

Key words: optimizer, machine translation, Transformer, deep learning, learning rate warm-up algorithm

摘要： 优化器是提高深度学习模型性能的关键因素，通过最小化损失函数使得模型的参数和真实参数接近从而提高模型的性能。随着GPT等大语言模型成为自然语言处理领域研究焦点，以梯度下降优化器为核心的传统优化器对大模型的优化效果甚微。因此自适应矩估计类优化器应运而生，其在提高模型泛化能力等方面显著优于传统优化器。以梯度下降、自适应梯度和自适应矩估计三类优化器为主线，分析其原理及优劣。将优化器应用到Transformer架构中，选取法-英翻译任务作为评估基准，通过实验深入探讨优化器在特定任务上的效果差异。实验结果表明，自适应矩估计类优化器在机器翻译任务上有效提高模型的性能。同时，展望优化器的发展方向并给出在具体任务上的应用场景。

关键词: 优化器, 机器翻译, Transformer, 深度学习, 学习率预热算法

CHANG Xilong, LIANG Kun, LI Wentao. Review of Development of Deep Learning Optimizer[J]. Computer Engineering and Applications, 2024, 60(7): 1-12.

常禧龙, 梁琨, 李文涛. 深度学习优化器进展综述[J]. 计算机工程与应用, 2024, 60(7): 1-12.

References

[1] DIEBOLD F X. What’s the big idea?“Big Data” and its origins[J]. MBGD Significance, 2021, 18(1): 36-37.
[2] SZE V, CHEN Y H, EINER J, et al. Hardware for machine learning: challenges and opportunities[C]//2017 IEEE Custom Integrated Circuits Conference (CICC), 2017.
[3] SCHMIDHUBER J. Deep learning in neural networks: an overview[J]. Neural Networks, 2015, 61: 85-117.
[4] TORFI A, SHIRVANI R A, KENESHLOO Y, et al. Natural language processing advancements by deep learning: a survey[J]. arXiv:2003.01200, 2020.
[5] KOROTEEV M. BERT: a review of applications in natural language processing and understanding[J]. arXiv:2103.11943, 2021.
[6] SHEN Y C, HSIA T C, HSU C H. Analysis of electronic health records based on deep learning with natural language processing[J]. Arabian Journal for Science and Engineering, 2021: 1-11.
[7] SUTSKEVER I, MARTENS J, HINTON G E. Generating text with recurrent neural networks[C]//International Conference on Machine Learning, 2016.
[8] YANG Z, YANG D, DYER C, et al. Hierarchical attention networks for document classification[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016.
[9] TAN Z X, SU J S, WANG B L, et al. Lattice-to-sequence attentional neural machine translation models[J]. Neurocomputing, 2018, 284: 138-147.
[10] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017.
[11] HAO S, LEE D H, ZHAO D. Sequence to sequence learning with attention mechanism for short-term passenger flow prediction in large-scale metro system[J]. Transportation Research Part C Emerging Technologies, 2019, 107: 287-300.
[12] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems, 2020: 1877-1901.
[13] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[Z]. 2018.
[14] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1(8): 9.
[15] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[J]. arXiv:1810.04805, 2018.
[16] 史加荣, 王丹, 尚凡华, 等. 随机梯度下降算法研究进展 [J]. 自动化学报, 2021, 47(9): 2103-2119.
SHI J R, WANG D, SHANG F H, et al. Research advances on stochastic gradient descent algorithms[J]. Acta Automatica Sinica, 2021, 47(9): 2103-2119.
[17] 张慧. 深度学习中优化算法的研究与改进[D]. 北京: 北京邮电大学, 2018.
ZHANG H. Research and improvement of optimization algorithms in deep learning[D]. Beijing: Beijing University of Posts and Telecommunications, 2018.
[18] KINGMA D P, BA J. Adam: a method for stochastic optimization[J]. arXiv:1412.6980, 2014.
[19] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]//Advances in Neural Information Processing Systems, 2014.
[20] PETERS M, NEUMANN M, ZETTLEMOYER L, et al. Dissecting contextual word embeddings: architecture and representation[J]. arXiv:1808.08949, 2018.
[21] DUCHI J, HAZAN E, SINGER Y. Adaptive subgradient methods for online learning and stochastic optimization[J]. Journal of Machine Learning Research, 2011, 12(7): 2121-2159.
[22] ZEILER M D. ADADELTA: an adaptive learning rate method[J]. arXiv:1212.5701, 2012.
[23] LOSHCHILOV I, HUTTER F. Decou pled weight decay regularization[J]. arXiv:1711.05101, 2017.
[24] LIU L, JIANG H, HE P, et al. On the variance of the adaptive learning rate and beyond[J]. arXiv:1908.03265, 2019.
[25] ZHANG M, LUCAS J, BA J, et al. Lookahead optimizer: k steps forward, 1 step back[C]//Advances in Neural Information Processing Systems, 2019.
[26] WRIGHT L, DEMEURE N. Ranger21: a synergistic deep learning optimizer[J]. arXiv:2106.13731, 2021.
[27] BOTTOU L, CURTIS F E, NOCEDAL J. Optimization methods for large-scale machine learning[J]. Society for Industrial and Applied Mathematics, 2018(2): 223-311.
[28] DEKEL O, GILAD-BACHRACH R, SHAMIR O, et al. Optimal distributed online prediction using mini-batches[J]. arXiv:1012.1367, 2010.
[29] SMITH L N. No more pesky learning rate guessing games[J]. arXiv:1506.01186, 2015.
[30] SMITH L N. Cyclical learning rates for training neural networks[J]. arXiv:1506.01186, 2015.
[31] O’DONOGHUE B, CANDES E. Adaptive restart for accelerated gradient schemes[J]. Foundations of Computational Mathematics, 2015, 15: 715-732.
[32] LOSHCHILOV I, HUTTER F. SGDR: stochastic gradient descent with restarts[J]. arXiv:1608.03983, 2016.
[33] DINH L , PASCANU R , BENGIO S , et al. Sharp minima can generalize for deep nets[J]. arXiv:1703.04933, 2017.
[34] KESKAR N S, MUDIGERE D, NOCEDAL J, et al. On large-batch training for deep learning: generalization gap and sharp minima[J]. arXiv:1609.04836, 2016.
[35] ZHANG J, HE T, SRA S, et al. Why gradient clipping accelerates training: a theoretical justification for adaptivity[C]//International Conference on Learning Representations, 2020.
[36] IOFFE S, SZEGEDY C. Batch normalization: accelerating deep network training by reducing internal covariate shift[C]//International Conference on Machine Learning, 2015: 448-456.
[37] XIE Z, YUAN L, ZHU Z, et al. Positive-negative momentum: manipulating stochastic gradient noise to improve generalization[C]//International Conference on Machine Learning, 2021: 11448-11458.
[38] GEORGIOU T, SCHMITT S, B?CK T, et al. Norm loss: an efficient yet effective regularization method for deep neural networks[C]//2020 25th International Conference on Pattern Recognition, 2021: 8812-8818.
[39] XIE Z, SATO I, SUGIYAMA M. Stable weight decay regularization[J]. arXiv:2011.11152, 2020.
[40] MA J, YARATS D. On the adequacy of untuned warmup for adaptive optimization[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 8828-8836.
[41] ARTETXE M, SCHWENK H. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 597-610.
[42] IYER N, THEJAS V, KWATRA N, et al. Wide-minima density hypothesis and the explore-exploit learning rate schedule[J]. Journal of Machine Learning Research, 2023, 24(65): 1-37.
[43] PENNINGTON J, SOCHER R, MANNING C. Glove: global vectors for word representation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014.
[44] DEAN J, CORRADO G, MONGA R, et al. Large scale distributed deep networks[C]//Advances in Neural Information Processing Systems, 2012.
[45] TIELEMAN T, HINTON G. Lecture 6.5-Rmsprop: divide the gradient by a running average of its recent magnitude[Z]. Coursera: Neural Networks for Machine Learning, 2021.
[46] ANDRYCHOWICZ M, DENIL M, GOMEZ S, et al. Learning to learn by gradient descent by gradient descent[C]//Advances in Neural Information Processing Systems, 2016.
[47] BABICHEV D, BACH F. Constant step size stochastic gradient descent for probabilistic modeling[J]. arXiv:1804.05567, 2018.
[48] YOU Y, GITMAN I, GINSBURG B. Large batch training of convolutional networks[J]. arXiv:1708.03888, 2017.
[49] KESKAR N S, SOCHER R. Improving generalization performance by switching from Adam to SGD[J]. arXiv:1712.07628, 2017.
[50] IZMAILOV P, PODOPRIKHIN D, GARIPOV T, et al. Averaging weights leads to wider optima and better generalization[J]. arXiv:1803.054071, 2018.