Path Planning Method for Mobile Robot Based on Curiosity Distillation Double Q-Network

doi:10.3778/j.issn.1002-8331.2208-0422

Abstract

Abstract: Aiming at the problem of overestimation, low sample utilization and sparse reward of DQN algorithm in mobile robot path planning, an end-to-end path planning method based on improved deep reinforcement learning is proposed, namely the curiosity distillation module dueling deep double Q-network prioritized experience replay（CDM-D3QN-PER）. This method is based on D3QN to reduce the adverse effects of overestimation. Long short term memory（LSTM） is added to the input to process the information of radar and camera to obtain more favorable environmental information. It uses prioritized experience replay（PER） as sampling method to make full use of samples and improve sample utilization，and the curiosity distillation module（CDM ） is introduced to alleviate the problem of reward sparsity to some extent. The experimental results show that compared with DQN, DDQN and D3QN, the number of robots reaching the target point trained by CDM-D3QN-PER algorithm is significantly increased, and it is three times that of DQN algorithm. The algorithm makes reward worthy of promotion, network convergence speed is improved, in unknown complex environment can better obtain the optimal path.

Key words: deep Q-network（DQN）algorithm, D3QN algorithm, curiosity distillation module, long short term memory（LSTM）, optimal path

摘要： 针对移动机器人的路径规划中DQN算法存在过估计、样本利用率低、奖励稀疏等，从而影响机器人获取最优路径的问题，提出基于好奇心蒸馏模块竞争架构的双Q网络（curiosity distillation module dueling deep double Q-network prioritized experience replay，CDM-D3QN-PER）方法。该方法以D3QN为基础，在输入端添加长短时记忆网络（long short term memory，LSTM）处理雷达和相机的信息，降低过估计的影响，获得更有利的环境信息；采用优先经验回放机制（prioritized experience replay，PER）作为采样方法，使样本得到充分利用，提高样本利用率；引入好奇心蒸馏模块（curiosity distillation module，CDM），缓解奖励稀疏的问题。通过仿真实验与DQN、DDQN、D3QN相比，CDM-D3QN-PER算法训练的机器人到达目标点的次数明显增加，为DQN算法的3倍。该算法使奖励值得到提升，加快了收敛速度，能够在复杂的未知环境中获得最优路径。

关键词: DQN算法, D3QN算法, 好奇心蒸馏模块, 长短时记忆网络（LSTM）, 最优路径

ZHANG Feng, GU Qiran, YUAN Shuai. Path Planning Method for Mobile Robot Based on Curiosity Distillation Double Q-Network[J]. Computer Engineering and Applications, 2023, 59(19): 316-322.

张凤, 顾琦然, 袁帅. 好奇心蒸馏双Q网络移动机器人路径规划方法[J]. 计算机工程与应用, 2023, 59(19): 316-322.

References

[1] 李文彪.基于深度强化学习的工业机器人避障路径规划方法[J].制造业自动化，2022，44（1）：127-130.
LI W B.Obstacle avoidance path planning method for industrial robots based on deep reinforcement learning[J].Manufacturing Automation，2022，44（1）：127-130.
[2] VOLODYMYR M，KORAY K，DAVID S，et al.Playing Atari with deep reinforcement learning[J].arXiv：1312.5602，2013.
[3] 王军，杨云霄，李莉.基于改进深度强化学习的移动机器人路径规划[J].电子测量技术，2021，44（22）：19-24.
WANG J，YANG Y X，LI L.Mobile robot path planning based on improved deep reinforcement learning[J].Electronic Measurement Techniques，2021，44（22）：19-24.
[4] CHRISTIANO P F，LEIKE J，BROWN T，et al.Deep reinforcement learning from human preferences[C]//Advances in Neural Information Processing Systems，2017.
[5] BELLEMARE M，SRINIVASAN S，OSTROVSKI G，et al.Unifying count-based exploration and intrinsic motivation[C]//Advances in Neural Information Processing Systems，2016.
[6] 武曲，张义，郭坤，等.结合LSTM的强化学习动态环境路径规划算法[J].小型微型计算机系统，2021，42（2）：334-339.
WU Q，ZHANG Y，GUO K，et al.LSTM combined with reinforcement learning dynamic environment path planning algorithm[J].Journal of Chinese Computer Systems，2021，42（2）：334-339.
[7] 封硕，舒红，谢步庆.基于改进深度强化学习的三维环境路径规划[J].计算机应用与软件，2021，38（1）：250-255.
FENG S，SHU H，XIE B Q.3D environmental path planning based on improved deep reinforcement learning[J].Computer Applications and Software，2021，38（1）：250-255.
[8] 张俊杰，张聪，赵涵捷.重复利用状态值的竞争深度Q网络算法[J].计算机工程与应用，2021，57（4）：134-140.
ZHANG J J，ZHANG C，ZHAO H J.Dueling deep Q network algorithm with state value reuse[J].Computer Engineering and Applications，2021，57（4）：134-140.
[9] 赵英男，刘鹏，赵巍，等.深度Q学习的二次主动采样方法[J].自动化学报，2019，45（10）：1870-1882.
ZHAO Y N，LIU P，ZHAO W，et al.Twice sampling method in deep Q-network[J].Acta Automation Sinica，2019，45（10）：1870-1882.
[10] ZHANG J，SPRINGENBERG J T，BOEDECKER J，et al.Deep reinforcement learning with successor features for navigation across similar environments[J].arXiv：1612.05533，2016.
[11] LYU L，ZHANG S，DING D，et al.Path planning via an improved DQN-based learning policy[J].IEEE Access，2019，7：67319-67330.
[12] 陈希亮，曹雷，李晨溪，等.基于重抽样优选缓存经验回放机制的深度强化学习方法[J].控制与决策，2018，33（4）：600-606.
CHEN X L，CAO L，LI C X，et al.Deep reinforcement learning via good choice resampling experience replay memory[J].Control and Decision，2018，33（4）：600-606.
[13] 董永峰，杨琛，董瑶，等.基于改进的DQN机器人路径规划[J].计算机工程与设计，2021，42（2）：552-558.
DONG Y F，YANG C，DONG Y，et al.Robot path planning based on improved DQN[J].Computer Engineering and Design，2021，42（2）：552-558.
[14] 徐志雄，曹雷，张永亮，等.基于动态融合目标的深度强化学习算法研究[J].计算机工程与应用，2019，55（7）：157-161.
XU Z X，CAO L，ZHANG Y L，et al.Research on deep reinforcement learning algorithm based on dynamic fusion target[J].Computer Engineering and Applications，2019，55（7）：157-161.
[15] 刘全，闫岩，朱斐，等.一种带探索噪音的深度循环Q网络[J].计算机学报，2019，42（7）：1588-1604.
LIU Q，YAN Y，ZHU F，et al.A deep recurrent Q network with exploratory noise[J].Chinese Journal of Computers，2019，42（7）：1588-1604.
[16] KIM K，KIM D，LEE J.Deep learning based on smooth driving for autonomous navigation[C]//Proceedings of the IEEE Confrence，2018.
[17] 夏宗涛，秦进.一种深度Q网络的改进算法[J].计算机应用研究，2019，36（12）：3661-3665.
XIA Z T，QIN J.Improved algorithm for deep Q net[J].Application Research of Computers，2019，36（12）：3661-3665.
[18] LONG P，FAN T，LIAO X，et al.Towards optimally decentralized multi-robot collision avoidance via deep reinforcement learning[C]//Proceedings of the 2018 IEEE International Conference on Robotics and Automation （ICRA），2018：6252-6259.
[19] YUE P，XIN J，ZHAO H，et al.Experimental research on deep reinforcement learning in autonomous navigation of mobile robot[C]//Proceedings of the 2019 14th IEEE Conference on Industrial Electronics and Applications （ICIEA），2019：1612-1616.
[20] VAN HASSELT H，GUEZ A，SILVER D.Deep reinforcement learning with double Q-learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2016.
[21] WANG Z Y，DE FREITAS N，LANCTOT M.Dueling network architectures for deep reinforcement learning[J].arXiv：1511.06581，2015.