多智能体强化学习的机械臂运动控制决策研究

doi:10.3778/j.issn.1002-8331.2207-0159

摘要/Abstract

摘要： 针对传统运动控算法存在环境适应性较差，效率低的问题。可以利用强化学习在环境中不断去探索试错，并通过奖励函数对神经网络参数进行调节的方法对机械臂的运动进行控制。但是在现实中无法提供机械臂试错的环境，采用Unity引擎平台来构建机械臂的数字孪生仿真环境，设置观察状态变量和设置奖励函数机制，并提出在该模型环境中对PPO（proximal policy optimization）与多智能体（agents）结合的M-PPO算法来加快训练速度，实现通过强化学习算法对机械臂进行智能运动控制，完成机械臂执行末端有效避障快速到达目标物体位置，并通过该算法与M-SAC（多智能体与Soft Actor-Critic结合）和PPO算法的实验结果进行分析，验证M-PPO算法在不同环境下机械臂运动控制决策调试上的有效性与先进性。实现孪生体自主规划决策，反向控制物理体同步运动的目的。

关键词: 强化学习, Unity引擎, 运动控制, M-PPO算法, 多智能体

Abstract: The traditional motion control algorithm has the problems of poor environmental adaptability and low efficiency. Reinforcement learning can be used to constantly explore trial and error in the environment, and the motion of the manipulator can be controlled by adjusting the neural network parameters through the reward function. However, in reality, it is impossible to provide a trial and error environment for the manipulator. This paper uses the Unity engine platform to build a digital twin simulation environment for the manipulator, set the observation state variables and set the reward function mechanism, and proposes the M-PPO algorithm combining PPO（proximal policy optimization） and multi-agent（agents） in this model environment to speed up the training speed and realize intelligent motion control of the manipulator through reinforcement learning algorithms. This paper completes the effective obstacle avoidance at the end of the manipulator’s execution and reach the target object’s position quickly, and also analyzes the experimental results of the algorithm, M-SAC（multi-agent and soft actor critical） and PPO algorithm. The effectiveness and progressiveness of M-PPO algorithm is verified in the debugging of the manipulator’s motion control decision under different environments. It achieves the purpose of independent planning and decision-making of twins and reverse control of synchronous movement of physical bodies.

Key words: reinforcement learning, Unity engine, motion control, M-PPO algorithm, multi-intelligence and agent

羊波, 王琨, 马祥祥, 范彪, 徐磊, 闫浩. 多智能体强化学习的机械臂运动控制决策研究[J]. 计算机工程与应用, 2023, 59(6): 318-325.

YANG Bo, WANG Kun, MA Xiangxiang, FAN Biao, XU Lei, YAN Hao. Research on Motion Control Method of Manipulator Based on Reinforcement Learning[J]. Computer Engineering and Applications, 2023, 59(6): 318-325.

参考文献

[1] POOR P，BROUM T，BASL J.Role of collaborative robots in industry 4.0 with target on education in industrial engineering[C]//2019 4th International Conference on Control，Robotics and Cybernetics（CRC），Tokyo，Japan，2019：42-46.
[2] WU J，HE H，PENG J，et al.Continuous reinforcement learning of energy management with deep Q network for a power split hybrid electric bus[J].Applied Energy，2018，222：799-811.
[3] SCHULMAN J，LEVINE S，MORITZ P，et al.Trust region policy optimization[J]arXiv：1502.05477，2015.
[4] ZHANG Y，DENG Z，GAO Y.Angle of arrival passive location algorithm based on proximal policy optimization[J].Electronics，2019，8（12）：1558.
[5] HAARNOJA T，ZHOU A，ABBEEL P，et al.Soft actor-critic：off-policy maximum entropy deep reinforcement learning with a stochastic actor[J].arXiv：1801.01290，2018.
[6] MORALES E F，ZARAGOZA J H.An introduction to reinforcement learning[J].IEEE，2011，11（4）：219-354.
[7] LUONG N C，HOANG D T，GONG S，et al.Applications of deep reinforcement learning in communications and networking：a survey[J].IEEE Communications Surveys & Tutorials，2019，21（4）：3133-3174.
[8] LIU Y，DU Z，WU Z，et al.Multiobjective preimpact trajectory planning of space manipulator for self-assembling a heavy payload：[J].International Journal of Advanced Robotic Systems，2021，18（1）：1-26.
[9] WANG J.Analysis and design of a k-winners-take-all model with a single state variable and the heaviside step activation function[J].IEEE Transactions on Neural Networks，2010，21（9）：1496-1506.
[10] KORMUSHEV P，CALINON S，CALDWELL D G.Imitation learning of positional and force skills demonstrated via kinesthetic teaching and haptic input[J].Advanced Robotics，2011，25（5）：581-603.
[11] 陈三风，韩鑫，湛邵斌，等.基于回归神经网络多机械臂运动控制研究[J].控制工程，2017，24（11）：2211-2217.
CHEN S F，HAN X，ZHAN S B，et al.Decentralized kinematic control of a class of collaborative redundant manipulators via recurrent neural networks[J].Control Engineering of China，2017，24（11）：2211-2217.
[12] 胡琴，赵一亭，夏方平，等.基于Soft-Actor-Critic算法的机器人局部路径规划算法[J].武汉理工大学学报，2021，43（9）：79-84.
HU Q，ZHAO Y T，XIA F P，et al.Robot local path planning algorithm based on soft actor critical algorithm[J].Journal of Wuhan University of Technology，2021，43（9）：79-84.
[13] 高阳，陈世福，陆鑫.强化学习研究综述[J].自动化学报，2004（1）：86-100.
GAO Y，CHEN S F，LU X.Research on reinforcement learning technology：a review[J].Acta Automatica Sinica，2004（1）：86-100.
[14] 杨帆.基于B+树存储的AABB包围盒碰撞检测算法[J].计算机科学，2021，48（S1）：331-333.
YANG F.Collision detection algorithm of AABB bounding box based on B+Tree[J].Computer Science，2021，48（S1）：331-333.
[15] 何柳柳，杨羊，李征，等.面向持续集成测试优化的强化学习奖励机制[J].软件学报，2019，30（5）：1438-1449.
HE L L，YANG Y，LI Z，et al.Reward of reinforcement learning of test optimization for continuous integration[J].Journal of Software，2019，30（5）：1438-1449.
[16] LI S，YAN Y H，REN J，et al.A sample-efficient actor-critic algorithm for recommendation diversification[J].Chinese Journalof Electronics，2020，29（1）：89-96.
[17] TANG C Y，LIU C H，CHEN W K，et al.Implementing action mask in proximal policy optimization （PPO） algorithm[J].ICT Express，2020，6（3）：200-203.
[18] 黄晓峰.基于强化学习的移动机械臂轨迹和路径规划方法研究[D].成都：电子科技大学，2021.
HUANG X F.Research on trajectory and path planning of mobile manipulator based on reinforcement learning[D].Chengdu：University of Electronic Science and Technology，2021.
[19] 殷昌盛，杨若鹏，朱巍，等.多智能体分层强化学习综述[J].智能系统学报，2020，15（4）：646-655.
YIN C S，YANG R P，ZHU W，et al.A survey on multi-agent hierarchical reinforcement learning[J].CAAI Transactions on Intelligent Systems，2020，15（4）：646-655.