计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (10): 341-352.DOI: 10.3778/j.issn.1002-8331.2302-0033

• 工程与应用 • 上一篇    下一篇

基于碰撞预测的强化模仿学习机器人导航方法

王浩杰,陶冶,鲁超峰   

  1. 青岛科技大学 信息科学技术学院,山东 青岛 266100
  • 出版日期:2024-05-15 发布日期:2024-05-15

Reinforcement Imitation Learning Method Based on Collision Prediction for Robots Navigation

WANG Haojie, TAO Ye, LU Chaofeng   

  1. School of Information Science and Technology, Qingdao University of Science and Technology, Qingdao, Shandong 266100, China
  • Online:2024-05-15 Published:2024-05-15

摘要: 基于学习的机器人导航方法存在对数据的依赖性高和在一些特定环境下表现不完美的问题,例如在空旷场景下无法走直线,在障碍物密集场景下碰撞率高。为了提高机器人的导航性能,提出了一种基于碰撞预测的强化模仿学习导航方法。在无模型的情况下,根据机器人的性能,建立马尔科夫决策过程(Markov decision process,MDP)中所需要的状态空间、动作空间、奖励函数。采用深度强化学习(deep reinforcement learning,DRL)在仿真环境中进行训练,使机器人获得能够在多障碍环境中导航和避障的能力。使用收集到的专家数据按照模仿学习方法对策略继续进行训练,改善强化学习在障碍物稀疏和密集两种极端情况下表现不完美的问题。设计了一个碰撞预测模型,将传统控制与深度学习相结合,根据预测结果,使机器人自适应地在不同环境下选取合适的控制策略,大大提高了导航的安全性。通过实验,在大量从未遇到过的场景下验证了所提出方法的导航性能和泛化能力。

关键词: 导航, 强化学习, 模仿学习, 碰撞预测, 混合控制

Abstract: The learning-based robot navigation methods have high dependence on the dataset and imperfect performance in some specific environments, for example, agents cannot run towards its goal through a wide-open space and have high collision rate in space with dense obstacles. In order to improve the navigation performance of robots in multi-obstacle scenarios, a reinforcement imitation learning navigation method based on collision prediction is proposed. Firstly, the state space, action space, and reward function are built for the Markov decision process (MDP) based on the performance of the robot without model. The model is trained in simulation environment based on reinforcement learning to allow the robot to acquire navigation and obstacle avoidance abilities in sparse obstacle environments. To improve the shortcomings of reinforcement learning in terms of imperfect performance in specific environments, imitation learning is used to train the policy. Finally, a collision prediction model is designed to combine traditional control with deep learning to make the robot adaptively select the appropriate control policy in different environments based on the prediction results, which greatly improves the safety of navigation. The navigation performance and generalization capability of the proposed method are experimentally verified in a large number of never-before-encountered scenarios.