Progress on Deep Reinforcement Learning in Path Planning

doi:10.3778/j.issn.1002-8331.2104-0369

Abstract

Abstract:

The purpose of path planning is to allow the robot to avoid obstacles and quickly plan the shortest path during the movement. Having analyzed the advantages and disadvantages of the reinforcement learning based path planning algorithm, the paper derives a typical deep reinforcement learning, Deep Q-learning Network（DQN） algorithm that can perform excellent path planning in a complex dynamic environment. Firstly, the basic principles and limitations of the DQN algorithm are analyzed in depth, and the advantages and disadvantages of various DQN variant algorithms are compared from four aspects：the training algorithm, the neural network structure, the learning mechanism and AC（Actor-Critic） framework. The paper puts forward the current challenges and problems to be solved in the path planning method based on deep reinforcement learning. The future development directions are proposed, which can provide reference for the development of intelligent path planning and autonomous driving.

Key words: deep reinforcement learning, path planning, neural network structure, Actor-Critic（AC） framework

摘要：

路径规划的目的是让机器人在移动过程中既能避开障碍物，又能快速规划出最短路径。在分析基于强化学习的路径规划算法优缺点的基础上，引出能够在复杂动态环境下进行良好路径规划的典型深度强化学习DQN（Deep Q-learning Network）算法。深入分析了DQN算法的基本原理和局限性，对比了各种DQN变种算法的优势和不足，进而从训练算法、神经网络结构、学习机制、AC（Actor-Critic）框架的多种变形四方面进行了分类归纳。提出了目前基于深度强化学习的路径规划方法所面临的挑战和亟待解决的问题，并展望了未来的发展方向，可为机器人智能路径规划及自动驾驶等方向的发展提供参考。

关键词: 深度强化学习, 路径规划, 神经网络结构, AC框架

ZHANG Rongxia, WU Changxu, SUN Tongchao, ZHAO Zengshun. Progress on Deep Reinforcement Learning in Path Planning[J]. Computer Engineering and Applications, 2021, 57(19): 44-56.

张荣霞，武长旭，孙同超，赵增顺. 深度强化学习及在路径规划中的研究进展[J]. 计算机工程与应用, 2021, 57(19): 44-56.

References

[1] 刘志荣，姜树海.基于强化学习的移动机器人路径规划研究综述[J].制造业自动化，2019，41（3）：90-92.
LIU Z R JIANG S H.Research review of mobile robot path planning based on reinforcement learning[J].Manufacturing Automation，2019，41（3）：90-92.
[2] LIU F，CHEN C，Li Z，et al.Research on path planning of robot based on deep reinforcement learning[C]//2020 39th Chinese Control Conference（CCC），Shenyang，27-29 July，2020：3730-3734.
[3] WONG C，CHIEN S Y，FENG H M，et al.Motion planning for dual-arm robot based on soft actor-critic[J].IEEE Access，2021，9：26871-26885.
[4] KANG K，BELKHALE S，KAHN G，et al.Generalization through simulation：Integrating simulated and real data into deep reinforcement learning for Vision-Based autonomous flight[C]//2019 International Conference on Robotics and Automation，2019.
[5] KHATIB O.Real-time obstacle avoidance system for manipulators and mobile robots[J].The International Journal of Robotics Research，1986，5（1）：90-98.
[6] HOTLE R，PEREZ M，ZIMMER R，et al.Hierarchical A*：Searching abstraction hierarchies efficiently[C]//Proceedings of the Thirteenth National Conference on Artificial Intelligence and Eighth Innovative Applications of Artificial Intelligence Conference，1996.
[7] GURUJI A K，AGARWAL H，PARSEDIYA D.Time efficient A* algorithm for robot path planning[J].Procedia Technology，2016，23：144-149.
[8] DORIGO M.The ant system：An autocatalytic optimizing process[C]//Proceedings of the First European Conference on Artificial Life，Paris，1991.
[9] MIRJALILI S，DONG J S，LEWIS A.Ant colony optimizer：Theory，literature review，and application in AUV path planning：Methods and applications[J].Studies in Computational Intelligence，2020，811：7-21.
[10] KARAMI A H，HASANZADEH M.An adaptive genetic algorithm for robot motion planning in 2D complex environments[J].Computers & Electrical Engineering，2015，43：317-329.
[11] 刘志荣，姜树海，袁雯雯，等.基于深度Q学习的移动机器人路径规划[J].测控技术，2019，38（7）：24-28.
LIU Z R，JIANG S H，YUAN W W，et al.Mobile robot path planning based on deep Q learning[J].Measurement and Control Technology，2019，38（7）：24-28.
[12] KOBER J，PETER J.Reinforcement learning in robotics：A survey[J].International Journal of Robotics Research，2013，32（11）：1238-1274.
[13] POLYDOROS A S，NALPANTIDIS L.Survey of model-based reinforcement learning：Applications on robotics[J].Journal of Intelligent & Robotic Systems，2017，86：1-21.
[14] MNIH V，KAVUKCUOGLU K，SILVER D，et al.Playing atari with deep reinforcement learning[J].arXiv：1312.5602，2013.
[15] ZHU Y，ZHAO D，LI X.Iterative adaptive dynamic programming for solving unknown nonlinear zero?sum game based on online data[J].IEEE Transactions on Neural Networks & Learning Systems，2017，28（3）：714-725.
[16] 孙彧，曹雷，陈希亮，等.多智能体深度强化学习研究综述[J].计算机工程与应用，2020，56（5）：13-24.
SUN Y，CAO L，CHEN X L，et al.Research review of multi-agent deep reinforcement learning[J].Computer Engineering and Applications，2020，56（5）：13-24.
[17] MNIH V，KAVUKCUOGLU K，SILVER D，et al.Human-level control through deep reinforcement learning[J].Nature，2015，518：529-553.
[18] 吴夏铭.基于深度强化学习的路径规划算法研究[D].长春：长春理工大学，2020.
WU X M.Research on path planning algorithm based on deep reinforcement learning[D].Changchun：Changchun University of Science and Technology，2020.
[19] LEI T，MING L.A robot exploration strategy based on Q-learning network[C]//2016 IEEE International Conference on Real-time Computing and Robotics，6-10 June，2016：57-62.
[20] 封硕，舒红，谢步庆.基于改进深度强化学习的三维环境路径规划[J].计算机应用与软件，2021，38（1）：250-255.
FENG S，SHU H，XIE B Q.Path planning for 3D environment based on improved deep reinforcement learning[J].Computer Applications and Software，2021，38（1）：250-255.
[21] JING X，ZhAO H，DING L，et al.Application of deep reinforcement learning in mobile robot path planning[C]//2017 Chinese Automation Congress（CAC），Jinan，20-22 Oct，2018：7112-7116.
[22] 孔松涛，刘池池，史勇，等.深度强化学习在智能制造中的应用展望综述[J].计算机工程与应用，2021，57（2）：49-59.
KONG S T，LIU C C，SHI Y，et al.Overview of the application prospects of deep reinforcement learning in intelligent manufacturing[J].Computer Engineering and Applications，2021，57（2）：49-59.
[23] HASSELT H V，GUEZ A，SILVER D.Deep reinforcement learning with double Q-learning[J].arXiv：1509. 06461，2015.
[24] WANG Z，SCHUAL T，HESSEL M，et al.Dueling network architectures for deep reinforcement learning[C]// Proceedings of International Conference on Machine Learning，2016：1995-2003.
[25] FOERSTER J，NARDELLI N，FARQUHAR G，et al.Stabilising experience replay for deep multi-agent reinforcement learning[J].arXiv：1702.08887，2017.
[26] NAIR A，SRINIVASAN P，BLACKWELL S，et al.Massively parallel methods for deep reinforcement learning[J].arXiv：1507.04296，2015.
[27] ANSCHEL O，BARAM N，SHIMKIN N.Averaged-DQN：Variance reduction and stabilization for deep reinforcement learning[C]//Proceedings of International Conference on Machine Learning，2017：176-185.
[28] ANSCHEL O，BARMA N，SHIMKIN N.Deep reinforcement learning with averaged target dqn[J].arXiv：1611. 01929，2016.
[29] LV L，ZHANG S，DING D，et al.Path planning via an improved dqn-based learning policy[J].IEEE Access，2019，7：67319-67330.
[30] 董永峰，杨琛，董瑶，等.基于改进的DQN机器人路径规划[J].计算机工程与设计，2021，42（2）：552-558.
DONG Y F，YANG C，DONG Y，et al.Path planning based on improved DQN robot[J].Computer Engineering and Design，2021，42（2）：552-558.
[31] SUTTON R，BARTO A.Reinforcement learning：An introduction[J].IEEE Transactions on Neural Networks，1998，9（5）：1054.
[32] QIU H，F LIU.A state representation dueling network for deep reinforcement learning[C]//2020 IEEE 32nd International Conference on Tools with Artificial Intelligence，Baltimore，2020：669-674.
[33] HOCHREITER S，SCHMIDHUBER J.Long short-term memory[J].Neural computation，1997，9（8）：1735-1780.
[34] HAUSKNECHT M，STONE P.Deep recurrent Q-learning for partially observable MDPs[J].arXiv：1507.06527，2015.
[35] 翟建伟.基于深度Q网络算法与模型的研究[D].苏州：苏州大学，2017.
ZHAI J W.Research on deep Q network algorithm and model[D].Suzhou：Soochow University，2017.
[36] 刘全，闫岩，朱斐，吴文，等.一种带探索噪音的深度循环Q网络[J].计算机学报，2019，42（7）：1588-1604.
LIU Q，YAN Y，ZHU F，WU WEN，et al.A deep cycle Q network with exploration noise[J].Chinese Journal of Computers，2019，42（7）：1588-1604.
[37] SCHAUL T，QUAN J，ANTONOGLOU I，et al.Prioritized experience replay[C]//Proceedings of International Conference on Learning Representations，2016：1-21.
[38] HORGAN D，QUAN J，BUDDEN D，et al.Distributed prioritized experience replay[J].arXiv：1803.00933，2018.
[39] HESTER T，VEEERIK M，PIETQUIN O，et al.Learning from demonst rations for real world reinforcement learning[J].arXiv：1704.03732，2017.
[40] LV L，ZHANG S，DING D，et al.Path planning via an improved DQN based learning policy[J].IEEE Access，2019，7：67319-67330.
[41] 孙辉辉，胡春鹤，张军国.移动机器人运动规划中的深度强化学习方法[J].控制与决策，2021（6）：1281-1292.
SUN H H，HU C H，ZHANG J G.Deep reinforcement learning method in mobile robot motion planning[J].Control and Decision，2021（6）：1281-1292.
[42] 刘建伟，高峰，罗雄麟.基于值函数和策略梯度的深度强化学习综述[J].计算机学报，2019，42（6）：1406-1438.
LIU J W，GAO F，LUO X L.Overview of deep reinforcement learning based on value function and policy gradient[J].Chinese Journal of Computers，2019，42（6）：1406-1438.
[43] SUTTON R S，MC ALLESTER D A，SINGH S P，et al.Policy gradient methods for reinforcement learning with function approximation[C]//Advances in Neural Information Processing Systems，2000：1057-1063.
[44] LILLICRAP T P，HUNT J J，PRITZEL A，et al.Continuous control with deep reinforcement learning[J].arXiv：1509. 02971，2015.
[45] HOU Z，DONG H，ZHANG K，et al.Knowledge-driven deep deterministic policy gradient for robotic multiple Peg-in-Hole assembly tasks[C]//2018 IEEE International Conference on Robotics and Biomimetics，2019.
[46] ZHENG Z，YUAN C，LIN Z，et al.Self-adaptive double boot-strapped DDPG[C]//Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence，2018：3198-3204.
[47] 武曲，张义，郭坤，等.结合LSTM的强化学习动态环境路径规划算法[J].小型微型计算机系统，2021，42（2）：334-339.
WU Q，ZHANG Y，GUO K，et al.Path planning algorithm for reinforcement learning dynamic environment combined with LSTM[J].Journal of Chinese Computer Systems，2021，42（2）：334-339.
[48] SCHULMAN J，LEVINE S，MORITZ P，et al.Trust region policy optimization[C]//Proceedings of the 32nd International Conference on International Conference on Machine Learning，2015：1889-1897.
[49] JHA D K，RAGHUNATHAN A U，ROMERES D.Quasi-newton trust region policy optimization[C]//2019 Conference on Robot Learning，2019.
[50] ZHANG H，BAI S，LAN X，et al.Hindsight trust region policy optimization[J].arXiv：1907.12439，2019.
[51] SHANI L，EFRONI Y，MANNORS S.Adaptive trust region policy optimization：global convergence and faster rates for regularized MDPs[J].arXiv：1909.02769，2019.
[52] SHANI L，YEFRONI F，MANNORS S.Adaptive trust region policy optimization：Global convergence and faster rates for regularized MDPs[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2020：5668-5675.
[53] SCHULMAN J，WOLSKI F，DHARIWAL P，et al.Proximal policy optimization algorithm[J].arXiv：1707.06347，2017.
[54] WANG Y，HE H，WEN C，et al.Truly proximal policy optimization[J].arXiv：1903.07940，2019.
[55] MNIH V，BADIA A P，MIRZA M，et al.Asynchronous methods for deep reinforcement learning[C]//Proceedings of the 33rd International Conference on International Conference on Machine Learning，2016：1928-1937.
[56] KARTAL B，HERNANDEZ-LEAL P，TAYLOR M E.Terminal prediction as an auxiliary task for deep reinforcement learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment，2019：38-44.
[57] LABAO A B，MARTIJA M A M，NAVAL P C.A3C-GS：Adaptive moment gradient sharing with locks for asynchronous actor-critic agents[J].IEEE Transactions on Neural Networks and Learning Systems，2020，99：1-15.
[58] HAARNOJA T，ZHOU A，ABBEEL P，et al.Soft actor-critic：Off-policy maximum entropy deep reinforcement learning with a stochastic actor[C]//Proceedings of the 35th International Conference on Machine Learning，2018：2976-2989.
[59] DUAN Y，CHEN X，HOUTHOOFT R，et al.Benchmarking deep reinforcement learning for continuous control[C]// Proceedings of the International Conference on Machine Learning，2016.
[60] FU F，KANG Y，ZHANG Z，et al.Soft actor-critic DRL for live transcoding and streaming in vehicular fog computing-enabled IoV[J].IEEE Internet of Things Journal，2021，8（3）：1308-1321.
[61] CHENG Y，SONG Y.Autonomous decision-making generation of UAV based on soft actor-critic algorithm[C]//Proceedings of the 39th Chinese Control Conference（CCC），Shenyang，27-29 July，2020：7350-7355.
[62] TANG H，WANG A，XUE F，et al.A novel hierarchical soft actor-critic algorithm for multi logistics robots task allocation[J].IEEE Access，2021，9：42568-42582.
[63] XIE L，WANG S，MARKHAM A，et al.Towards monocular vision based obstacle avoidance through deep reinforcement learning[J].arXiv：1706.09829.2017.
[64] HESSEL M，MODAYIL J，HASSELT H V，et al.Rainbow：Combining improvements in deep reinforcement learning[J].arXiv：1710.02298，2017.
[65] KULKARNI T D，NARASIMHAN K，SAEEDI A，et al.Hierarchical deep reinforcement learning：integrating temporal abstraction and intrinsic motivation[C]//Advances in Neural Information Processing Systems，2016：3675-3683.
[66] 徐志雄，曹雷，张永亮，等.基于动态融合目标的深度强化学习算法研究[J].计算机工程与应用，2019，55（7）：157-161.
XU Z X，CAO L，ZHANG Y L，et al.Research on deep reinforcement learning algorithm based on dynamic fusion target[J].Computer Engineering and Applications，2019，55（7）：157-161.
[67] 张俊杰，张聪，赵涵捷.重复利用状态值的竞争深度Q网络算法[J].计算机工程与应用，2021，57（4）：134-140.
ZHANG J J，ZHANG C，ZHAO H J.Competitive deep Q network algorithm for reusing state values[J].Computer Engineering and Applications，2021，57（4）：134-140.
[68] AVRACHENKOV K，BORKAR V S，DOLHARE H P，et al.Full gradient DQN reinforcement learning：a provably Convergent Scheme[J].arXiv：2103.05981，2021.
[69] HUI T S，ISHAK M K，MOHAMED M F P，et al.Balancing excitation and inhibition of spike neuron using Deep Q Network（DQN）[C]//Proceedings of the?5th International Conference on Electronic Design（ICED），2020.
[70] PAN J，WANG X，CHENG Y，et al.Multisource transfer double DQN based on actor learning[J].IEEE Transactions on Neural Networks & Learning Systems，2018，29（6）：2227-2238.
[71] SILVER D，HUANG A，MADDISON C J，et al.Mastering the game of go with deep neural networks and tree search[J].Nature，2016，529：484-489.
[72] SILVER D，SCHRITTWIESER J，SIMONYAN K，et al.Mastering the game of Go without human knowledge[J].Nature，2017，550：354-359.
[73] 刘朝阳，穆朝絮，孙长银.深度强化学习算法与应用研究现状综述[J].智能科学与技术学报，2020，2（4）：314-326.
LIU C Y，MU C X，SUN C Y.A review of the research status of deep reinforcement learning algorithms and applications[J].Journal of Intelligent Science and Technology，2020，2（4）：314-326.
[74] MNIH V，KAVUKCUOGLU K，SILVER D，et al.Human-level control through deep reinforcement learning[J].Nature，2015.518：529-533.
[75] BADIA A P，PIOT B，UROWSKI S K P，et al.Agent57：Outperforming the Atari Human Benchmark[J].arXiv：2003.13350，2020.
[76] KEMPKA M，WYDMUC M，RUNC G，et al.Vizdoom：A doom-based AI research platform for visual reinforcement learning[C]//2016 IEEE Conference on Computational Intelligence and Games（CIG），Santorini，20-23 Sept，2016：1-8.
[77] VINYALS O，EWALDS T，BARTUNOV S，et al.Starcraft II：A new challenge for reinforcement learning[J].arXiv：1708.04782，2017.
[78] YE D，LIU Z，SUN M，et al.Mastering complex control in MOBA games with deep reinforcement learning[J].arXiv：1912.09729，2019.
[79] JADERBERG M，MNIH V，CZARNECKI W M，et al.Reinforcement learning with unsupervised auxiliary tasks[J].arXiv：1611.05397，2016.
[80] ZHU Y，MOTTAGHI R，KOLVE E，et al.Target driven visual navigation in indoor scenes using deep reinforcement learning[C]//2017 IEEE International Conference on Robotics and Automation（ICRA），Singapore，29 May-3 June，2017：3357-3364.
[81] KULHANEK J，DERNER E，BABUKA R.Visual navigation in real world indoor environments using end-to-end deep reinforcement learning[J].IEEE Robotics and Automation Letters，2020，3：4345-4352.
[82] 王毅然，经小川，田涛，等.基于强化学习的多Agent路径规划方法研究[J].计算机应用与软件，2019，36（8）：165-171.
WANG Y R，JING X C，TIAN T，et al.Research on multi-agent path planning method based on reinforcement learning[J].Computer Applications and Software，2019，36（8）：165-171.
[83] 梁宸.基于强化学习的多智能体协作策略研究[D].沈阳：沈阳理工大学，2020.
LIANG C.Research on multi-agent cooperative cooperative strategy based on reinforcement learning[D].Shenyang：Shenyang University of Technology，2020.
[84] FOERSTER J，FARQUHAR G，AFOURAS T，et al.Counter factual multi-agent policy gradients[J].arXiv：1705.08926，2017.
[85] MAO H，GONG Z，NI Y，et al.ACCNet：Actor-coordinator-critic net for “learning-to-communicate” with deep multi-agent reinforcement learning[J].arXiv：1706.03235，2017.
[86] SUNEHAG P，LEVER G，GRUSLYS A，et al.Value-decomposition networks for cooperative multi-agent learning[J].arXiv：1706.05296，2017.
[87] IQBAL S，SHA F.Actor-attention-critic for multi-agent reinforcement learning[C]//Proceedings of the International Conference on Machine Learning，2019.
[88] 梁星星，冯旸赫，马扬，等.多Agent深度强化学习综述[J].自动化学报，2020，46（12）：2537-2557.
LIANG X X，FENG Y H，MA Y，et al.A review on multi-agent deep reinforcement learning[J].Acta Automatica Sinica，2020，46（12）：2537-2557.
[89] 李航、李国杰、汪可友.基于深度强化学习的电动汽车实时调度策略[J].电力系统自动化，2020，692（22）：166-172.
LI H，LI G J，WANG K Y.Electric vehicle real-time scheduling strategy based on deep reinforcement learning[J].Automation of Electric Power System，2020，692（22）：166-172.
[90] 赵婷婷，孔乐，韩雅杰，等.模型化强化学习研究综述[J].计算机科学与探索，2020，14（6）：918-927.
ZHAO T T，KONG L，HAN Y J.A review of modeling reinforcement learning[J].Journal of Frontiers of Computer Science and Technology，2020，14（6）：918-927.
[91] XIR R，MENGZ，WANG L，et al.Unmanned aerial vehicle path planning algorithm based on deep reinforcement learning in large-scale and dynamic environments[J].IEEE Access，2021，9：24884-24900.
[92] ZHAO W，QUERALTA J P，Westerlund T，et al.Sim-to-real transfer in deep reinforcement learning for robotics：a survey[C]//2020 IEEE Symposium Series on Computational Intelligence（SSCI），Dec 1-4，2021：737-744.