Computer Engineering and Applications ›› 2024, Vol. 60 ›› Issue (7): 1-12.DOI: 10.3778/j.issn.1002-8331.2307-0370
• Research Hotspots and Reviews • Previous Articles Next Articles
CHANG Xilong, LIANG Kun, LI Wentao
Online:
2024-04-01
Published:
2024-04-01
常禧龙,梁琨,李文涛
CHANG Xilong, LIANG Kun, LI Wentao. Review of Development of Deep Learning Optimizer[J]. Computer Engineering and Applications, 2024, 60(7): 1-12.
常禧龙, 梁琨, 李文涛. 深度学习优化器进展综述[J]. 计算机工程与应用, 2024, 60(7): 1-12.
Add to citation manager EndNote|Ris|BibTeX
URL: http://cea.ceaj.org/EN/10.3778/j.issn.1002-8331.2307-0370
[1] DIEBOLD F X. What’s the big idea?“Big Data” and its origins[J]. MBGD Significance, 2021, 18(1): 36-37. [2] SZE V, CHEN Y H, EINER J, et al. Hardware for machine learning: challenges and opportunities[C]//2017 IEEE Custom Integrated Circuits Conference (CICC), 2017. [3] SCHMIDHUBER J. Deep learning in neural networks: an overview[J]. Neural Networks, 2015, 61: 85-117. [4] TORFI A, SHIRVANI R A, KENESHLOO Y, et al. Natural language processing advancements by deep learning: a survey[J]. arXiv:2003.01200, 2020. [5] KOROTEEV M. BERT: a review of applications in natural language processing and understanding[J]. arXiv:2103.11943, 2021. [6] SHEN Y C, HSIA T C, HSU C H. Analysis of electronic health records based on deep learning with natural language processing[J]. Arabian Journal for Science and Engineering, 2021: 1-11. [7] SUTSKEVER I, MARTENS J, HINTON G E. Generating text with recurrent neural networks[C]//International Conference on Machine Learning, 2016. [8] YANG Z, YANG D, DYER C, et al. Hierarchical attention networks for document classification[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016. [9] TAN Z X, SU J S, WANG B L, et al. Lattice-to-sequence attentional neural machine translation models[J]. Neurocomputing, 2018, 284: 138-147. [10] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017. [11] HAO S, LEE D H, ZHAO D. Sequence to sequence learning with attention mechanism for short-term passenger flow prediction in large-scale metro system[J]. Transportation Research Part C Emerging Technologies, 2019, 107: 287-300. [12] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems, 2020: 1877-1901. [13] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[Z]. 2018. [14] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1(8): 9. [15] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[J]. arXiv:1810.04805, 2018. [16] 史加荣, 王丹, 尚凡华, 等. 随机梯度下降算法研究进展 [J]. 自动化学报, 2021, 47(9): 2103-2119. SHI J R, WANG D, SHANG F H, et al. Research advances on stochastic gradient descent algorithms[J]. Acta Automatica Sinica, 2021, 47(9): 2103-2119. [17] 张慧. 深度学习中优化算法的研究与改进[D]. 北京: 北京邮电大学, 2018. ZHANG H. Research and improvement of optimization algorithms in deep learning[D]. Beijing: Beijing University of Posts and Telecommunications, 2018. [18] KINGMA D P, BA J. Adam: a method for stochastic optimization[J]. arXiv:1412.6980, 2014. [19] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]//Advances in Neural Information Processing Systems, 2014. [20] PETERS M, NEUMANN M, ZETTLEMOYER L, et al. Dissecting contextual word embeddings: architecture and representation[J]. arXiv:1808.08949, 2018. [21] DUCHI J, HAZAN E, SINGER Y. Adaptive subgradient methods for online learning and stochastic optimization[J]. Journal of Machine Learning Research, 2011, 12(7): 2121-2159. [22] ZEILER M D. ADADELTA: an adaptive learning rate method[J]. arXiv:1212.5701, 2012. [23] LOSHCHILOV I, HUTTER F. Decou pled weight decay regularization[J]. arXiv:1711.05101, 2017. [24] LIU L, JIANG H, HE P, et al. On the variance of the adaptive learning rate and beyond[J]. arXiv:1908.03265, 2019. [25] ZHANG M, LUCAS J, BA J, et al. Lookahead optimizer: k steps forward, 1 step back[C]//Advances in Neural Information Processing Systems, 2019. [26] WRIGHT L, DEMEURE N. Ranger21: a synergistic deep learning optimizer[J]. arXiv:2106.13731, 2021. [27] BOTTOU L, CURTIS F E, NOCEDAL J. Optimization methods for large-scale machine learning[J]. Society for Industrial and Applied Mathematics, 2018(2): 223-311. [28] DEKEL O, GILAD-BACHRACH R, SHAMIR O, et al. Optimal distributed online prediction using mini-batches[J]. arXiv:1012.1367, 2010. [29] SMITH L N. No more pesky learning rate guessing games[J]. arXiv:1506.01186, 2015. [30] SMITH L N. Cyclical learning rates for training neural networks[J]. arXiv:1506.01186, 2015. [31] O’DONOGHUE B, CANDES E. Adaptive restart for accelerated gradient schemes[J]. Foundations of Computational Mathematics, 2015, 15: 715-732. [32] LOSHCHILOV I, HUTTER F. SGDR: stochastic gradient descent with restarts[J]. arXiv:1608.03983, 2016. [33] DINH L , PASCANU R , BENGIO S , et al. Sharp minima can generalize for deep nets[J]. arXiv:1703.04933, 2017. [34] KESKAR N S, MUDIGERE D, NOCEDAL J, et al. On large-batch training for deep learning: generalization gap and sharp minima[J]. arXiv:1609.04836, 2016. [35] ZHANG J, HE T, SRA S, et al. Why gradient clipping accelerates training: a theoretical justification for adaptivity[C]//International Conference on Learning Representations, 2020. [36] IOFFE S, SZEGEDY C. Batch normalization: accelerating deep network training by reducing internal covariate shift[C]//International Conference on Machine Learning, 2015: 448-456. [37] XIE Z, YUAN L, ZHU Z, et al. Positive-negative momentum: manipulating stochastic gradient noise to improve generalization[C]//International Conference on Machine Learning, 2021: 11448-11458. [38] GEORGIOU T, SCHMITT S, B?CK T, et al. Norm loss: an efficient yet effective regularization method for deep neural networks[C]//2020 25th International Conference on Pattern Recognition, 2021: 8812-8818. [39] XIE Z, SATO I, SUGIYAMA M. Stable weight decay regularization[J]. arXiv:2011.11152, 2020. [40] MA J, YARATS D. On the adequacy of untuned warmup for adaptive optimization[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 8828-8836. [41] ARTETXE M, SCHWENK H. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 597-610. [42] IYER N, THEJAS V, KWATRA N, et al. Wide-minima density hypothesis and the explore-exploit learning rate schedule[J]. Journal of Machine Learning Research, 2023, 24(65): 1-37. [43] PENNINGTON J, SOCHER R, MANNING C. Glove: global vectors for word representation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014. [44] DEAN J, CORRADO G, MONGA R, et al. Large scale distributed deep networks[C]//Advances in Neural Information Processing Systems, 2012. [45] TIELEMAN T, HINTON G. Lecture 6.5-Rmsprop: divide the gradient by a running average of its recent magnitude[Z]. Coursera: Neural Networks for Machine Learning, 2021. [46] ANDRYCHOWICZ M, DENIL M, GOMEZ S, et al. Learning to learn by gradient descent by gradient descent[C]//Advances in Neural Information Processing Systems, 2016. [47] BABICHEV D, BACH F. Constant step size stochastic gradient descent for probabilistic modeling[J]. arXiv:1804.05567, 2018. [48] YOU Y, GITMAN I, GINSBURG B. Large batch training of convolutional networks[J]. arXiv:1708.03888, 2017. [49] KESKAR N S, SOCHER R. Improving generalization performance by switching from Adam to SGD[J]. arXiv:1712.07628, 2017. [50] IZMAILOV P, PODOPRIKHIN D, GARIPOV T, et al. Averaging weights leads to wider optima and better generalization[J]. arXiv:1803.054071, 2018. |
[1] | ZHOU Dingwei, HU Jing, ZHANG Liangrui, DUAN Feiya. Collaborative Correction Technology of Label Omission in Dataset for Object Detection [J]. Computer Engineering and Applications, 2024, 60(8): 267-273. |
[2] | ZHOU Bojun, CHEN Zhiyu. Survey of Few-Shot Image Classification Based on Deep Meta-Learning [J]. Computer Engineering and Applications, 2024, 60(8): 1-15. |
[3] | SUN Shilei, LI Ming, LIU Jing, MA Jingang, CHEN Tianzhen. Research Progress on Deep Learning in Field of Diabetic Retinopathy Classification [J]. Computer Engineering and Applications, 2024, 60(8): 16-30. |
[4] | WANG Weitai, WANG Xiaoqiang, LI Leixiao, TAO Yihao, LIN Hao. Review of Construction and Applications of Spatio-Temporal Graph Neural Network in Traffic Flow Prediction [J]. Computer Engineering and Applications, 2024, 60(8): 31-45. |
[5] | XIE Weiyu, ZHANG Qiang. Review on Detection of Drones and Birds in Photoelectric Images Based on Deep Learning Convolutional Neural Network [J]. Computer Engineering and Applications, 2024, 60(8): 46-55. |
[6] | GUO Jin, SONG Tingqiang, SUN Yuanyuan, GONG Chuanjiang, LIU Yalin, MA Xinglu, FAN Haisheng. Improved Deeplabv3+ Crop Classification Method Based on Double Attention Fusion [J]. Computer Engineering and Applications, 2024, 60(8): 110-120. |
[7] | ZOU Zhentao, LI Zeping. Improved YOLOv7 for UAV Image Object Detection [J]. Computer Engineering and Applications, 2024, 60(8): 173-181. |
[8] | ZHOU Yutong, MA Zhiqiang, XU Biqi, JIA Wenchao, LYU Kai, LIU Jia. Survey of Deep Learning-Based on Emotion Generation in Conversation [J]. Computer Engineering and Applications, 2024, 60(7): 13-25. |
[9] | JIANG Liang, ZHANG Cheng, WEI Dejian, CAO Hui, DU Yuzheng. Deep Learning in Aided Diagnosis of Osteoporosis [J]. Computer Engineering and Applications, 2024, 60(7): 26-40. |
[10] | LIU Jianhua, WANG Nan, BAI Mingchen. Progress of Instantiated Reality Augmentation Method for Smart Phone Indoor Scene Elements [J]. Computer Engineering and Applications, 2024, 60(7): 58-69. |
[11] | HAO Zhifeng, LIU Jun, WEN Wen, CAI Ruichu. Temporal Event Prediction Based on Implicit Relationship of Multiple Sequences [J]. Computer Engineering and Applications, 2024, 60(7): 119-127. |
[12] | PIAN Xinyang, WANG Yu, ZHANG Jie. Applying Attention Transformer Module to 3D Lip Sequence Identification [J]. Computer Engineering and Applications, 2024, 60(7): 141-146. |
[13] | YUAN Jing, PAN Su, XIE Hao, XU Wenpeng. Stock Price Prediction Integrating Investor Sentiment Based on S_AM_BiLSTM Model [J]. Computer Engineering and Applications, 2024, 60(7): 274-281. |
[14] | TAN Zhenlin, LIU Ziliang, HUANG Aiquan, CHEN Huihui, ZHONG Yong. Review of Deep Learning Methods for Palm Vein Recognition [J]. Computer Engineering and Applications, 2024, 60(6): 55-67. |
[15] | LAI Jing’an, CHEN Ziqiang, SUN Zongwei, PEI Qingqi. Lightweight Foggy Weather Object Detection Method Based on YOLOv5 [J]. Computer Engineering and Applications, 2024, 60(6): 78-88. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||