[1] DIEBOLD F X. What’s the big idea?“Big Data” and its origins[J]. MBGD Significance, 2021, 18(1): 36-37.
[2] SZE V, CHEN Y H, EINER J, et al. Hardware for machine learning: challenges and opportunities[C]//2017 IEEE Custom Integrated Circuits Conference (CICC), 2017.
[3] SCHMIDHUBER J. Deep learning in neural networks: an overview[J]. Neural Networks, 2015, 61: 85-117.
[4] TORFI A, SHIRVANI R A, KENESHLOO Y, et al. Natural language processing advancements by deep learning: a survey[J]. arXiv:2003.01200, 2020.
[5] KOROTEEV M. BERT: a review of applications in natural language processing and understanding[J]. arXiv:2103.11943, 2021.
[6] SHEN Y C, HSIA T C, HSU C H. Analysis of electronic health records based on deep learning with natural language processing[J]. Arabian Journal for Science and Engineering, 2021: 1-11.
[7] SUTSKEVER I, MARTENS J, HINTON G E. Generating text with recurrent neural networks[C]//International Conference on Machine Learning, 2016.
[8] YANG Z, YANG D, DYER C, et al. Hierarchical attention networks for document classification[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016.
[9] TAN Z X, SU J S, WANG B L, et al. Lattice-to-sequence attentional neural machine translation models[J]. Neurocomputing, 2018, 284: 138-147.
[10] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017.
[11] HAO S, LEE D H, ZHAO D. Sequence to sequence learning with attention mechanism for short-term passenger flow prediction in large-scale metro system[J]. Transportation Research Part C Emerging Technologies, 2019, 107: 287-300.
[12] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners[C]//Advances in Neural Information Processing Systems, 2020: 1877-1901.
[13] RADFORD A, NARASIMHAN K, SALIMANS T, et al. Improving language understanding by generative pre-training[Z]. 2018.
[14] RADFORD A, WU J, CHILD R, et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1(8): 9.
[15] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[J]. arXiv:1810.04805, 2018.
[16] 史加荣, 王丹, 尚凡华, 等. 随机梯度下降算法研究进展 [J]. 自动化学报, 2021, 47(9): 2103-2119.
SHI J R, WANG D, SHANG F H, et al. Research advances on stochastic gradient descent algorithms[J]. Acta Automatica Sinica, 2021, 47(9): 2103-2119.
[17] 张慧. 深度学习中优化算法的研究与改进[D]. 北京: 北京邮电大学, 2018.
ZHANG H. Research and improvement of optimization algorithms in deep learning[D]. Beijing: Beijing University of Posts and Telecommunications, 2018.
[18] KINGMA D P, BA J. Adam: a method for stochastic optimization[J]. arXiv:1412.6980, 2014.
[19] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[C]//Advances in Neural Information Processing Systems, 2014.
[20] PETERS M, NEUMANN M, ZETTLEMOYER L, et al. Dissecting contextual word embeddings: architecture and representation[J]. arXiv:1808.08949, 2018.
[21] DUCHI J, HAZAN E, SINGER Y. Adaptive subgradient methods for online learning and stochastic optimization[J]. Journal of Machine Learning Research, 2011, 12(7): 2121-2159.
[22] ZEILER M D. ADADELTA: an adaptive learning rate method[J]. arXiv:1212.5701, 2012.
[23] LOSHCHILOV I, HUTTER F. Decou pled weight decay regularization[J]. arXiv:1711.05101, 2017.
[24] LIU L, JIANG H, HE P, et al. On the variance of the adaptive learning rate and beyond[J]. arXiv:1908.03265, 2019.
[25] ZHANG M, LUCAS J, BA J, et al. Lookahead optimizer: k steps forward, 1 step back[C]//Advances in Neural Information Processing Systems, 2019.
[26] WRIGHT L, DEMEURE N. Ranger21: a synergistic deep learning optimizer[J]. arXiv:2106.13731, 2021.
[27] BOTTOU L, CURTIS F E, NOCEDAL J. Optimization methods for large-scale machine learning[J]. Society for Industrial and Applied Mathematics, 2018(2): 223-311.
[28] DEKEL O, GILAD-BACHRACH R, SHAMIR O, et al. Optimal distributed online prediction using mini-batches[J]. arXiv:1012.1367, 2010.
[29] SMITH L N. No more pesky learning rate guessing games[J]. arXiv:1506.01186, 2015.
[30] SMITH L N. Cyclical learning rates for training neural networks[J]. arXiv:1506.01186, 2015.
[31] O’DONOGHUE B, CANDES E. Adaptive restart for accelerated gradient schemes[J]. Foundations of Computational Mathematics, 2015, 15: 715-732.
[32] LOSHCHILOV I, HUTTER F. SGDR: stochastic gradient descent with restarts[J]. arXiv:1608.03983, 2016.
[33] DINH L , PASCANU R , BENGIO S , et al. Sharp minima can generalize for deep nets[J]. arXiv:1703.04933, 2017.
[34] KESKAR N S, MUDIGERE D, NOCEDAL J, et al. On large-batch training for deep learning: generalization gap and sharp minima[J]. arXiv:1609.04836, 2016.
[35] ZHANG J, HE T, SRA S, et al. Why gradient clipping accelerates training: a theoretical justification for adaptivity[C]//International Conference on Learning Representations, 2020.
[36] IOFFE S, SZEGEDY C. Batch normalization: accelerating deep network training by reducing internal covariate shift[C]//International Conference on Machine Learning, 2015: 448-456.
[37] XIE Z, YUAN L, ZHU Z, et al. Positive-negative momentum: manipulating stochastic gradient noise to improve generalization[C]//International Conference on Machine Learning, 2021: 11448-11458.
[38] GEORGIOU T, SCHMITT S, B?CK T, et al. Norm loss: an efficient yet effective regularization method for deep neural networks[C]//2020 25th International Conference on Pattern Recognition, 2021: 8812-8818.
[39] XIE Z, SATO I, SUGIYAMA M. Stable weight decay regularization[J]. arXiv:2011.11152, 2020.
[40] MA J, YARATS D. On the adequacy of untuned warmup for adaptive optimization[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 8828-8836.
[41] ARTETXE M, SCHWENK H. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 597-610.
[42] IYER N, THEJAS V, KWATRA N, et al. Wide-minima density hypothesis and the explore-exploit learning rate schedule[J]. Journal of Machine Learning Research, 2023, 24(65): 1-37.
[43] PENNINGTON J, SOCHER R, MANNING C. Glove: global vectors for word representation[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014.
[44] DEAN J, CORRADO G, MONGA R, et al. Large scale distributed deep networks[C]//Advances in Neural Information Processing Systems, 2012.
[45] TIELEMAN T, HINTON G. Lecture 6.5-Rmsprop: divide the gradient by a running average of its recent magnitude[Z]. Coursera: Neural Networks for Machine Learning, 2021.
[46] ANDRYCHOWICZ M, DENIL M, GOMEZ S, et al. Learning to learn by gradient descent by gradient descent[C]//Advances in Neural Information Processing Systems, 2016.
[47] BABICHEV D, BACH F. Constant step size stochastic gradient descent for probabilistic modeling[J]. arXiv:1804.05567, 2018.
[48] YOU Y, GITMAN I, GINSBURG B. Large batch training of convolutional networks[J]. arXiv:1708.03888, 2017.
[49] KESKAR N S, SOCHER R. Improving generalization performance by switching from Adam to SGD[J]. arXiv:1712.07628, 2017.
[50] IZMAILOV P, PODOPRIKHIN D, GARIPOV T, et al. Averaging weights leads to wider optima and better generalization[J]. arXiv:1803.054071, 2018. |