Computer Engineering and Applications ›› 2022, Vol. 58 ›› Issue (8): 33-44.DOI: 10.3778/j.issn.1002-8331.2112-0082
• Research Hotspots and Reviews • Previous Articles Next Articles
SI Yanna, PU Jiexin, SUN Lifan
Online:
2022-04-15
Published:
2022-04-15
司彦娜,普杰信,孙力帆
SI Yanna, PU Jiexin, SUN Lifan. Review of Research on Approximate Reinforcement Learning Algorithms[J]. Computer Engineering and Applications, 2022, 58(8): 33-44.
司彦娜, 普杰信, 孙力帆. 近似强化学习算法研究综述[J]. 计算机工程与应用, 2022, 58(8): 33-44.
Add to citation manager EndNote|Ris|BibTeX
URL: http://cea.ceaj.org/EN/10.3778/j.issn.1002-8331.2112-0082
[1] SILVER D,HUANG A,MADDISON C J,et al.Mastering the game of Go with deep neural networks and tree search[J].Nature,2016,529:484-489. [2] KIEU S,BADE A,HIJAZI M,et al.A survey of deep learning for lung disease detection on medical images:state-of-the-art,taxonomy,issues and future directions[J].Journal of Imaging,2020,6(12):131. [3] ZHAO Z Q,ZHENG P,XU S,et al.Object detection with deep learning:a review[J].IEEE Transactions on Neural Networks and Learning Systems,2019,30(11):3212-3232. [4] DARGAN S,KUMAR M,AYYAGARI M R,et al.A survey of deep learning and its applications:a new paradigm to machine learning[J].Archives of Computational Methods in Engineering,2020,27(4):1071-1092. [5] BUSONIU L,DE BRUIN T,TOLIC D,et al.Reinforcement learning for control:performance,stability,and deep approximators[J].Annual Review in Control,2018,46:8-28. [6] ZHOU S K,LE H N,LUU K,et al.Deep reinforcement learning in medical imaging:a literature review[J].Medical Image Analysis,2021,73:102193. [7] GUPTA S,SINGAL G,GARG D.Deep reinforcement learning techniques in diversified domains:a survey[J].Archives of Computational Methods in Engineering,2021,28(7):1-40. [8] IBRAHIM A M,YAU K L A,CHONG Y W,et al.Applications of multi-agent deep reinforcement learning:models and algorithms[J].Applied Sciences,2021,11(22):10870. [9] HUANG C,CHEN G,GONG Y,et al.Buffer-aided relay selection for cooperative hybrid NOMA/OMA networks with asynchronous deep reinforcement learning[J].IEEE Journal on Selected Areas in Communications,2021,39(8):2514-2525. [10] TORTORA M,CORDELLI E,SICILIA R,et al.Deep reinforcement learning for fractionated radiotherapy in non-small cell lung carcinoma[J].Artificial Intelligence in Medicine,2021,119:102137. [11] TORRENTS-BARRENA J,PIELLA G,GRATACOS E,et al.Deep Q-CapsNet reinforcement learning framework for intrauterine cavity segmentation in TTTS fetal surgery planning[J].IEEE Transactions on Medical Imaging,2020,39(10):3113-3124. [12] SUN Y,CHENG J,ZHANG G,et al.Mapless motion planning system for an autonomous underwater vehicle using policy gradient-based deep reinforcement learning[J].Journal of Intelligent & Robotic Systems,2019,96(3/4):591-601. [13] SHANTIA A,TIMMERS R,CHONG Y,et al.Two-stage visual navigation by deep neural networks and multi-goal reinforcement learning[J].Robotics and Autonomous Systems,2021,138:103731. [14] LAI Y H,WU T C,LAI C F,et al.Cognitive optimal-setting control of AIoT industrial applications with deep reinforcement learning[J].IEEE Transactions on Industrial Informatics,2020,17(3):2116-2123. [15] LEE J,KOH H,CHOE H J.Learning to trade in financial time series using high-frequency through wavelet transformation and deep reinforcement learning[J].Applied Intelligence,2021,51(8):6202-6223. [16] BRADTKE S J,BARTO A G.Linear least-squares algorithms for temporal difference learning[J].Machine Learning,1996,22(1/2/3):33-57. [17] XU X,HE H,HU D.Efficient reinforcement learning using recursive least-squares methods[J].Journal of Artificial Intelligence Research,2002,16:259-292. [18] LAGOUDAKIS M G,PARR R.Least-squares policy iteration[J].Journal of Machine Learning Research,2003,4(6):1107-1149. [19] BU?ONIU L,ERNST D,DE SCHUTTER B,et al.Online least-squares policy iteration for reinforcement learning control[C]//Proceedings of the 2010 American Control Conference,2010:486-491. [20] 周鑫,刘全,傅启明,等.一种批量最小二乘策略迭代方法[J].计算机科学,2014,41(9):232-238. ZHOU X,LIU Q,FU Q M,et al.Batch least-squares policy iteration[J].Computer Science,2014,41(9):232-238. [21] 程玉虎,冯涣婷,王雪松.基于状态-动作图测地高斯基的策略迭代强化学习[J].自动化学报,2011,37(1):44-51. CHENG Y H,FENG H T,WANG X S.Policy iteration reinforcement learning based on geodesic Gaussian basis defined on state-action graph[J].Acta Automatica Sinica,2011,37(1):44-51. [22] SONG T,LI D,CAO L,et al.Kernel-based least squares temporal difference with gradient correction[J].IEEE Transactions on Neural Networks and Learning Systems,2015,27(4):771-782. [23] 季挺,张华.基于状态聚类的非参数化近似广义策略迭代增强学习算法[J].控制与决策,2017,32(12):2153-2161. JI T,ZHANG H.Nonparametric approximation generalized policy iteration reinforcement learning algorithm based on states clustering[J].Control and Decision,2017,32(12):2153-2161. [24] KOLTER J Z,NG A Y.Regularization and feature selection in least-squares temporal difference learning[C]//Proceedings of the 26th Annual International Conference on Machine Learning(ICML),2009:521-528. [25] CHEN S,GENG C,GU R.An efficient L2-norm regularized least-squares temporal difference learning algorithm[J].Knowledge-Based Systems,2013,45(6):94-99. [26] KIM M S,HONG G G,LEE J J.Online fuzzy Q-learning with extended rule and interpolation technique[C]//1999 IEEE/RSJ International Conference on Intelligent Robots and Systems,1999:757-762. [27] SHI H,LI X,HWANG K S,et al.Decoupled visual servoing with fuzzy Q-learning[J].IEEE Transactions on Industrial Informatics,2016,14(1):241-252. [28] DERHAMI V,MAJD V J,AHMADABADI M N.Fuzzy Sarsa learning and the proof of existence of its stationary points[J].Asian Journal of Control,2008,10(5):535-549. [29] HUANG J,ANGELOV P P,YIN C.Interpretable policies for reinforcement learning by empirical fuzzy sets[J].Engineering Applications of Artificial Intelligence,2020,91:103559. [30] 刘智斌,曾晓勤,徐彦,等.采用资格迹的神经网络学习控制算法[J].控制理论与应用,2015,32(7):887-894. LIU Z B,ZENG X Q,XU Y,et al.Learning to control by neural networks using eligibility traces[J].Control Theory and Applications,2015,32(7):887-894. [31] ZHANG F,DUAN S,WANG L.Route searching based on neural networks and heuristic reinforcement learning[J].Cognitive Neurodynamics,2017,11(3):245-258. [32] PAN J,WANG X,CHENG Y,et al.Multi-source transfer ELM-based Q learning[J].Neurocomputing,2014,137(11):57-64. [33] 张耀中,胡小方,周跃,等.基于多层忆阻脉冲神经网络的强化学习及应用[J].自动化学报,2019,45(8):1536-1547. ZHANG Y Z,HU X F,ZHOU Y,et al.A novel reinforcement learning algorithm based on multilayer memristive spiking neural network with applications[J].Acta Automatica Sinica,2019,45(8):1536-1547. [34] 闵华清,曾嘉安,罗荣华,等.一种状态自动划分的模糊小脑模型关节控制器值函数拟合方法[J].控制理论与应用,2011,28(2):256-260. MIN H Q,ZENG J A,LUO R H,et al.Fuzzy cerebellar model arithmetic controller with automatic state partition for value function approximation[J].Control Theory and Applications,2011,28(2):256-260. [35] 季挺,张华.基于CMAC的非参数化近似策略迭代增强学习[J].计算机工程与应用,2019,55(2):128-136. JI T,ZHANG H.Nonparametric approximation policy iteration reinforcement learning based on CMAC[J].Computer Engineering and Applications,2019,55(2):128-136. [36] MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning[J].Nature,2015,518:529-533. [37] HASSELT H,GUEZ A,SILVER D.Deep reinforcement learning with double Q-Learning[C]//Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence,2016:2094-2100. [38] SCHAUL T,QUAN J,ANTONOGLOU I,et al.Prioritized experience replay[J].arXiv:1511.05952,2015. [39] WANG Z,SCHAUL T,HESSEL M,et al.Dueling network architectures for deep reinforcement learning[C]//International Conference on Machine Learning(ICML),2016:1995-2003. [40] HAUSKNECHT M,STONE P.Deep recurrent q-learning for partially observable mdps[J].arXiv:1507.06527,2015. [41] WILLIAMS R J.Simple statistical gradient-following algorithms for connectionist reinforcement learning[J].Machine Learning,1992,8(3/4):229-256. [42] BAXTER J,BARTLETT P L.Infinite-horizon policy-gradient estimation[J].Journal of Artificial Intelligence Research,2001,15(1):319-350. [43] ZHAO T,NIU G,XIE N,et al.Regularized policy gradients:direct variance reduction in policy gradient estimation[C]//7th Asian Conference on Machine Learning(ACML),2015:333-348. [44] VIEN N A,YU H,CHUNG T C.Hessian matrix distribution for Bayesian policy gradient reinforcement learning[J].Information Sciences,2011,181(9):1671-1685. [45] XU T,LIU Q,PENG J.Stochastic variance reduction for policy gradient estimation[J].arXiv:1710.06034,2017. [46] 程玉虎,冯焕婷,王雪松.基于参数探索的期望最大化策略搜索[J].自动化学报,2012,38(1):38-45. CHENG Y H,FENG H T,WANG X S.Expectation-maximization policy search with parameter-based exploration[J].Acta Automatica Sinica,2012,38(1):38-45. [47] HACHIYA H,PETERS J,SUGIYAMA M.Reward-weighted regression with sample reuse for direct policy search in reinforcement learning[J].Neural Computation,2011,23(11):2798-2832. [48] HWANG K S,LIN J L,SHI H,et al.Policy learning with human reinforcement[J].International Journal of Fuzzy Systems,2016,18(4):618-629. [49] SCHULMAN J,LEVINE S,MORITZ P,et al.Trust region policy optimization[J].arXiv:1502.05477v5,2015. [50] SILVER D,LEVER G,HEESS N,et al.Deterministic policy gradient algorithms[C]//International Conference on Machine Learning(ICML),2014:387-395. [51] BARTO A G,SUTTON R S,ANDERSON C W.Neuronlike adaptive elements that can solve difficult learning control problems[J].IEEE Transactions on Systems,Man,and Cybernetics,1983(5):834-846. [52] SUTTON R S.Temporal credit assignment in reinforcement learning[D].University of Massachusetts at Amherst,1984. [53] ANDERSON C W.Learning and problem solving with multilayer connectionist systems[D].University of Massachusetts at Amherst,1986. [54] GRONDMAN I,BUSONIU L,LOPES G A D,et al.A survey of actor-critic reinforcement learning:standard and natural policy gradients[J].IEEE Transactions on Systems,Man,and Cybernetics,Part C,2012,42(6):1291-1307. [55] CHENG Y H,YI J Q,ZHAO D B.Application of actor-critic learning to adaptive state space construction[C]//Proceedings of 2004 International Conference on Machine Learning and Cybernetics,2004:2985-2990. [56] WANG X S,CHENG Y H,YI J Q.A fuzzy Actor-Critic reinforcement learning network[J].Information Sciences,2007,177(18):3764-3781. [57] BHATNAGAR S.An actor-critic algorithm with function approximation for discounted cost constrained Markov decision processes[J].Systems & Control Letters,2010,59(12):760-766. [58] LEE D H,LEE J J.Incremental receptive field weighted actor-critic[J].IEEE Transactions on Industrial Informatics,2013,9(1):62-71. [59] PETERS J,VIJAYAKUMAR S,SCHAAL S.Reinforcement learning for humanoid robotics[C]//Proceedings of the Third IEEE-RAS International Conference on Humanoid Robots,2003:1-20. [60] 朱斐,朱海军,刘全,等.一种解决连续空间问题的真实在线自然梯度AC算法[J].软件学报,2018,29(2):267-282. ZHU F,ZHU H J,LIU Q,et al.True online natural Actor-Critic algorithm for the continuous space problem[J].Journal of Software,2018,29(2):267-282. [61] 钟珊,刘全,傅启明,等.一种采用模型学习和经验回放加速的正则化自然行动器评判器算法[J].计算机学报,2019,42(3):82-103. ZHONG S,LIU Q,FU Q M,et al.A regularized natural AC algorithm with the acceleration of model learning and experience replay[J].Chinese Journal of Computers,2019,42(3):82-103. [62] LILLICRAP T P,HUNT J J,PRITZEL A,et al.Continuous control with deep reinforcement learning[J].arXiv:1509.02971,2015. [63] FUJIMOTO S,HOOF H,MEGER D.Addressing function approximation error in actor-critic methods[J].arXiv:1802.09477v3,2018. [64] MNIH V,BADIA A P,MIRZA M,et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning(ICML),2016:1928-1937. [65] HAARNOJA T,ZHOU A,ABBEEL P,et al.Soft actor-critic:off-policy maximum entropy deep reinforcement learning with a stochastic actor[J].arXiv:1801.01290v2,2018. [66] SCHULMAN J,WOLSKI F,DHARIWAL P,et al.Proximal policy optimization algorithms[J].arXiv:1707.06347v2,2017. [67] WANG Y,LI X,ZHANG J,et al.Review of wheeled mobile robot collision avoidance under unknown environment[J].Science Progress,2021,104(3):00368504211037771. [68] TAI L,LIU M.Mobile robots exploration through CNN-based reinforcement learning[J].Robotics and Biomimetics,2016,3(1):1-8. [69] TAI L,LI S H,LIU M.A deep-network solution towards modeless obstacle avoidance[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems.Piscataway,USA:IEEE,2016:2759-2764. [70] ZHU Y,MOTTAGHI R,KOLVE E,et al.Target-driven visual navigation in indoor scenes using deep reinforcement learning[C]//2017 IEEE International Conference on Robotics and Automation(ICRA),2017:3357-3364. [71] LEE H S,JEONG J.Mobile robot path optimization technique based on reinforcement learning algorithm in warehouse environment[J].Applied Sciences,2021,11(3):1209. [72] SAMSANI S S,MUHAMMAD M S.Socially compliant robot navigation in crowded environment by human behavior resemblance using deep reinforcement learning[J].IEEE Robotics and Automation Letters,2021,6(3):5223-5230. [73] DE JESUS J C,BOTTEGA J A,DE SOUZA LEITE M A,et al.Deep deterministic policy gradient for navigation of mobile robots[J].Journal of Intelligent & Fuzzy Systems,2021,40:349-361. [74] CHU Z,SUN B,ZHU D,et al.Motion control of unmanned underwater vehicles via deep imitation reinforcement learning algorithm[J].IET Intelligent Transport Systems,2020,14(7):764-774. [75] YOU S,DIAO M,GAO L,et al.Target tracking strategy using deep deterministic policy gradient[J].Applied Soft Computing,2020,95:106490. [76] LIN X B,LIU J,YU Y,et al.Event-triggered reinforcement learning control for the quadrotor UAV with actuator saturation[J].Neurocomputing,2020,415:135-145. [77] EVANGELOS P,FARHAD A,MA O,et al.Robotic manipulation and capture in space:a survey[J].Frontiers in Robotics and AI,2021,8:686723. [78] HU Y Z,WANG W X,LIU H,et al.Reinforcement learning tracking control for robotic manipulator with kernel-based dynamic model[J].IEEE Transactions on Neural Networks and Learning Systems,2020,31(9):3570-3578. [79] KIM K,HAN D K,PARK J H,et al.Motion planning of robot manipulators for a smoother path using a twin delayed deep deterministic policy gradient with hindsight experience replay[J].Applied Sciences,2020,10(2):575. [80] LIN G C,ZHU L X,LI J H,et al.Collision-free path planning for a guava-harvesting robot based on recurrent deep reinforcement learning[J].Computers and Electronics in Agriculture,2021,188:106350. [81] JIANG D,CAI Z Q,PENG H J,et al.Coordinated control based on reinforcement learning for dual-arm continuum manipulators in space capture missions[J].Journal of Aerospace Engineering,2021,34(6):04021087. [82] WONG C C,CHIEN S Y,FENG H M,et al.Motion planning for dual-arm robot based on soft actor-critic[J].IEEE Access,2021,9:26871-26885. |
[1] | WEI Tingting, YUAN Weilin, LUO Junren, ZHANG Wanpeng. Survey of Opponent Modeling Methods and Applications in Intelligent Game Confrontation [J]. Computer Engineering and Applications, 2022, 58(9): 19-29. |
[2] | GAO Jingpeng, HU Xinyu, JIANG Zhiye. Unmanned Aerial Vehicle Track Planning Algorithm Based on Improved DDPG [J]. Computer Engineering and Applications, 2022, 58(8): 264-272. |
[3] | XU Jie, ZHU Yukun, XING Chunxiao. Research on Financial Trading Algorithm Based on Deep Reinforcement Learning [J]. Computer Engineering and Applications, 2022, 58(7): 276-285. |
[4] | ZHAO Shuxu, YUAN Lin, ZHANG Zhanping. Multi-agent Edge Computing Task Offloading [J]. Computer Engineering and Applications, 2022, 58(6): 177-182. |
[5] | DENG Xin, NA Jun, ZHANG Handuo, WANG Yulin, ZHANG Bin. Personalized Adjustment Method of Intelligent Lamp Based on Deep Reinforcement Learning [J]. Computer Engineering and Applications, 2022, 58(6): 264-270. |
[6] | CHEN Zhongyu, HAN Xie, XIE Jianbin, XIONG Fengguang, KUANG Liqun. Reinforcement Learning-Based Image Matching Method Under Double Loss Estimations [J]. Computer Engineering and Applications, 2022, 58(5): 240-246. |
[7] | XU Bo, ZHOU Jianguo, WU Jing, LUO Wei. Routing Optimization Method Based on DDPG and Programmable Data Plane [J]. Computer Engineering and Applications, 2022, 58(3): 143-150. |
[8] | SONG Haonan, ZHAO Gang, SUN Ruoying. Developments of Knowledge Reasoning Based on Deep Reinforcement Learning [J]. Computer Engineering and Applications, 2022, 58(1): 12-25. |
[9] | NIU Pengfei, WANG Xiaofeng, LU Lei, ZHANG Jiulong. Survey on Vehicle Reinforcement Learning in Routing Problem [J]. Computer Engineering and Applications, 2022, 58(1): 41-55. |
[10] | ZHOU Youhang, ZHAO Hanyun, LIU Hanjiang, LI Yuze, XIAO Yuqin. Self-Learning Gait Planning Method for Biped Robot Using DDPG [J]. Computer Engineering and Applications, 2021, 57(6): 254-259. |
[11] | WANG Xiao, TANG Lun, HE Xiaoyu, CHEN Qianbin. Multi-dimensional Resource Optimization of Service Function Chain Based on Deep Reinforcement Learning [J]. Computer Engineering and Applications, 2021, 57(4): 68-76. |
[12] | LAI Jun, WEI Jingyi, CHEN Xiliang. Overview of Hierarchical Reinforcement Learning [J]. Computer Engineering and Applications, 2021, 57(3): 72-79. |
[13] | MA Zhihao, ZHU Xiangbin. Research on Quasi-hyperbolic Momentum Gradient for Adversarial Deep Reinforcement Learning [J]. Computer Engineering and Applications, 2021, 57(24): 90-99. |
[14] | LI Baoshuai, YE Chunming. Job Shop Scheduling Problem Based on Deep Reinforcement Learning [J]. Computer Engineering and Applications, 2021, 57(23): 248-254. |
[15] | WANG Jun, CAO Lei, CHEN Xiliang, LAI Jun, ZHANG Legui. Overview on Reinforcement Learning of Multi-agent Game [J]. Computer Engineering and Applications, 2021, 57(21): 1-13. |
Viewed | ||||||
Full text |
|
|||||
Abstract |
|
|||||