Application of Deep Reinforcement Learning Algorithm on Intelligent Military Decision System

doi:10.3778/j.issn.1002-8331.2104-0114

Abstract

Abstract:

Deep reinforcement learning algorithm can well achieve discrete decision-making behavior, but it is difficult to apply to the highly complex and continuous modern battlefield situations, and the algorithm is difficult to converge in multi-agent environment. To solve these problems, an improved Deep Deterministic Policy Gradient（DDPG） algorithm is proposed, which introduces the experience replay technology based on priority and single training mode to improve the convergence speed of the algorithm; at the same time, an exploration strategy of mixed double noise is designed in the algorithm to realize complex and continuous military decision-making and control behavior. The intelligent military decision simulation platform based on the improved DDPG algorithm is developed by unity3D. The simulation environment of Blue Army Infantry attacking Red Army military base is built to simulate multi-agent combat training. The experimental results show that the algorithm can drive multiple combat agents to complete tactical maneuvers and achieve tactical behaviors, such as bypassing obstacles to reach the dominant area for shooting. The algorithm has faster convergence speed and better stability. It can get higher round rewards, and achieves the purpose of improving the efficiency of intelligent military decision-making.

Key words: deep reinforcement learning, deep Q-network, deep deterministic policy gradient, intelligent military decision-making, multi-agent

摘要：

深度强化学习算法能够很好地实现离散化的决策行为，但是难以运用于高度复杂且行为连续的现代战场环境，同时多智能体环境下算法难以收敛。针对这些问题，提出了一种改进的深度确定策略梯度（DDPG）算法，该算法引入了基于优先级的经验重放技术和单训练模式，以提高算法收敛速度；同时算法中还设计了一种混合双噪声的探索策略，从而实现复杂且连续的军事决策控制行为。采用Unity开发了基于改进DDPG算法的智能军事决策仿真平台，搭建了蓝军步兵进攻红军军事基地的仿真环境，模拟多智能体的作战训练。实验结果显示，该算法能够驱动多作战智能体完成战术机动，实现绕过障碍物抵达优势区域进行射击等战术行为，算法拥有更快的收敛速度和更好的稳定性，可得到更高的回合奖励，达到了提高智能军事决策效率的目的。

关键词: 深度强化学习, 深度Q网络, 深度确定策略梯度, 智能军事决策, 多智能体

KUANG Liqun, LI Siyuan, FENG Li, HAN Xie, XU Qingyu. Application of Deep Reinforcement Learning Algorithm on Intelligent Military Decision System[J]. Computer Engineering and Applications, 2021, 57(20): 271-278.

况立群，李思远，冯利，韩燮，徐清宇. 深度强化学习算法在智能军事决策中的应用[J]. 计算机工程与应用, 2021, 57(20): 271-278.

References

[1] 殷昌盛，杨若鹏，邹小飞，等.指挥智能化研究综述[C]//第八届中国指挥控制大会论文集，2020.
YIN C S，YANG R P，ZHOU X F，et al.A survey on military intelligent command[C]//Proceedings of the 8th China Command and Control Conference，2020.
[2] SUTTON R S，BARTO A G.Introduction to reinforcement learning[M].Cambridge：MIT Press，1998.
[3] LUONG N C，HOANG D T，GONG S，et al.Applications of deep reinforcement learning in communications and networking：a survey[J].IEEE Communications Surveys & Tutorials，2019，21（4）：3133-3174.
[4] WATKINS C J C H，DAYAN P.Q-learning[J].Machine Learning，1992，8（3/4）：279-292.
[5] JIANG H，GUI R，CHEN Z，et al.An improved sarsa（lambda） reinforcement learning algorithm for wireless communication systems[J].IEEE Access，2019，7：115418-115427.
[6] MNIH V，KAVUKCUOGLU K，SILVER D，et al.Human-level control through deep reinforcement learning[J].Nature，2015，518（7540）：529-533.
[7] HASSELT H V.Double Q-learning[C]//Advances in Neural Information Processing Systems，2010：2613-2621.
[8] PANOV A I，YAKOVLEV K S，SUVOROV R.Grid path planning with deep reinforcement learning：preliminary results[J].Procedia Computer Science，2018，123：347-353.
[9] 杨克巍.半自治作战agent模型及其应用研究[D].长沙：国防科学技术大学，2004.
YANG K W.Research and application of semi-autonomous combat agent model[D].Changsha：National University of Defense Technology，2004.
[10] MNIH V，KAVUKCUOGLU K，SILVER D，et al.Playing atari with deep reinforcement learning[J].arXiv：1312.
5602，2013.
[11] 陈希亮，张永亮.基于深度强化学习的陆军分队战术决策问题研究[J].军事运筹与系统工程，2017，31（3）：20-27.
CHENG X L，ZHANG Y L.Research on tactical decision of army unit based on deep reinforcement learning[J].Military Operations Research and Systems Engineering，2017，31（3）：20-27.
[12] MA X，XIA L，ZHAO Q.Air-combat strategy using deep Q-learning[C]//2018 Chinese Automation Congress（CAC），2018：3952-3957.
[13] 姚桐，王越，董岩，等.深度强化学习在作战任务规划中的应用[J].飞航导弹，2020（4）：16-21.
YAO T，WANG Y，DONG Y，et al.Application of deep reinforcement learning in combat mission planning[J].Aerodynamic Missile Journal，2020（4）：16-21.
[14] LILLICRAP T P，HUNT J J，PRITZEL A，et al.Continuous control with deep reinforcement learning[J].arXiv：1509.02971，2015.
[15] LI Yue，QIU Xiaohui，LIU Xiaodong，et al.Deep reinforcement learning and its application in autonomous fitting optimization for attack areas of UCAVs[J].Journal of Systems Engineering and Electronics，2020，31（4）：734-742.
[16] 郑健，陈建，朱琨.基于多智能体强化学习的无人集群协同设计[J].指挥信息系统与技术，2020，11（6）：26-31.
ZHENG J，CHEN J，ZHU K.Unmanned swarm cooperative design based on multi-agent reinforcement learning[J].Command Informatipn System and Technology，2020，11（6）：26-31.
[17] 陈亮，梁宸，张景异，等.Actor-Critic框架下一种基于改进DDPG的多智能体强化学习算法[J].控制与决策，2021，36（1）：75-82.
CHEN L，LIANG C，ZHANG J Y，et al.A multi-agent reinforcement learning algorithm based on improved DDPG in Actor-Critic framework[J].Control and Decision，2021，36（1）：75-82.
[18] 赵毓，郭继峰，郑红星，等.基于强化学习的多无人机避碰计算制导方法[J].导航定位与授时，2021，8（1）：31-40.
ZHAO Y，GUO J，ZHENG H X，et al.Reinforcement learning-based collision avoidance guidance algorithm for fixed-wing UAVs[J].Navigation Positioning and Timing，2021，8（1）：31-40.
[19] LI B，WU Y.Path planning for UAV ground target tracking via deep reinforcement learning[J].IEEE Access，2020，8：29064-29074.
[20] HUANG W，WANG Y，YI X.A deep reinforcement learning approach to preserve connectivity for multi-robot systems[C]//2017 10th International Congress on Image and Signal Processing，BioMedical Engineering and Informatics（CISP-BMEI），2017：1-7.
[21] 吴昭欣，李辉，王壮，等.基于深度强化学习的智能仿真平台设计[J].战术导弹技术，2020（4）：193-200.
WU S X，LI H，WANG Z，et al.The design of intelligence simulation platform based on DRL[J].Tactical Missile Technology，2020（4）：193-200.
[22] HOU Y，LIU L，WEI Q，et al.A novel DDPG method with prioritized experience replay[C]//2017 IEEE International Conference on Systems，Man，and Cybernetics（SMC），2017：316-321.
[23] 吴球业.基于Actor-Critic结构的受扰倒立摆平衡控制究[J].信息系统工程，2020（3）：146-147.
WU Q Y.Research on balance control of disturbed inverted pendulum based on actor critical structure[J].China CIO News，2020（3）：146-147.
[24] SCHAUL T，QUAN J，ANTONOGLOU I，et al.Prioritized experience replay[J].arXiv：1511.05952，2015.