计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (19): 302-308.DOI: 10.3778/j.issn.1002-8331.2102-0307

• 工程与应用 • 上一篇    下一篇

面向机械臂轨迹规划的强化学习奖励函数设计

靳栋银,李跃,邵振洲,施智平,关永   

  1. 1.首都师范大学 信息工程学院,北京 100048
    2.首都师范大学 轻型工业机械臂与安全验证北京市重点实验室,北京 100048
    3.河北工业职业技术学院 计算机技术系,石家庄 050000
    4.首都师范大学 成像技术北京市高精尖创新中心,北京 100048
  • 出版日期:2022-10-01 发布日期:2022-10-01

Design of Reinforcement Learning Reward Function for Trajectory Planning of Robot Manipulator

JIN Dongyin, LI Yue, SHAO Zhenzhou, SHI Zhiping, GUAN Yong   

  1. 1.College of Information Engineering, Capital Normal University, Beijing 100048, China
    2.Beijing Key Laboratory of Light Industrial Robot and Safety Verification, Capital Normal University, Beijing 100048, China
    3.Department of Computer Technology, Hebei College of Industry and Technology, Shijiazhuang 050000, China
    4.Beijing Advanced Innovation Center for Imaging Technology, Capital Normal University, Beijing 100048, China
  • Online:2022-10-01 Published:2022-10-01

摘要: 针对基于深度强化学习的机械臂轨迹规划方法学习效率较低,规划策略鲁棒性差的问题,提出了一种基于语音奖励函数的机械臂轨迹规划方法,利用语音定义规划任务的不同状态,并采用马尔科夫链对状态进行建模,为轨迹规划提供全局指导,降低深度强化学习优化的盲目性。提出的方法结合了基于语音的全局信息和基于相对距离的局部信息来设计奖励函数,在每个状态根据相对距离与语音指导的契合程度对机械臂进行奖励或惩罚。实验证明,设计的奖励函数能够有效地提升基于深度强化学习的机械臂轨迹规划的鲁棒性和收敛速度。

关键词: 深度强化学习, 机械臂, 轨迹规划, 语音奖励函数

Abstract: Aiming at the problems of low learning efficiency of robotic manipulator trajectory planning methods based on deep reinforcement learning and poor robustness of planning strategies, this paper proposes a robotic manipulator trajectory planning method based on voice reward function. The voice instructions are defined as the different states of planning task, and modeled using the Markov chain. It provides the global guidance for the trajectory planning, reduces the blindness of deep reinforcement learning. Meanwhile, the proposed method combines the global information based on the voice and local information of relative distance to design the reward function, considering the degree of fitness between the relative distance and voice guidance. Experimental results demonstrate that the proposed reward function improves the robustness and convergence rate of manipulator trajectory planningeff ectively.

Key words: deep reinforcement learning, robot manipulator, trajectory planning, voicereward function