计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (20): 271-278.DOI: 10.3778/j.issn.1002-8331.2104-0114

• 工程与应用 • 上一篇    下一篇

深度强化学习算法在智能军事决策中的应用

况立群,李思远,冯利,韩燮,徐清宇   

  1. 1.中北大学 大数据学院,太原 030051
    2.北方自动控制技术研究所 仿真装备部,太原 030006
  • 出版日期:2021-10-15 发布日期:2021-10-21

Application of Deep Reinforcement Learning Algorithm on Intelligent Military Decision System

KUANG Liqun, LI Siyuan, FENG Li, HAN Xie, XU Qingyu   

  1. 1.School of Data Science and Technology, North University of China, Taiyuan 030051, China
    2.Department of Simulation Equipment, North Automatic Control Technology Institute, Taiyuan 030006, China
  • Online:2021-10-15 Published:2021-10-21

摘要:

深度强化学习算法能够很好地实现离散化的决策行为,但是难以运用于高度复杂且行为连续的现代战场环境,同时多智能体环境下算法难以收敛。针对这些问题,提出了一种改进的深度确定策略梯度(DDPG)算法,该算法引入了基于优先级的经验重放技术和单训练模式,以提高算法收敛速度;同时算法中还设计了一种混合双噪声的探索策略,从而实现复杂且连续的军事决策控制行为。采用Unity开发了基于改进DDPG算法的智能军事决策仿真平台,搭建了蓝军步兵进攻红军军事基地的仿真环境,模拟多智能体的作战训练。实验结果显示,该算法能够驱动多作战智能体完成战术机动,实现绕过障碍物抵达优势区域进行射击等战术行为,算法拥有更快的收敛速度和更好的稳定性,可得到更高的回合奖励,达到了提高智能军事决策效率的目的。

关键词: 深度强化学习, 深度Q网络, 深度确定策略梯度, 智能军事决策, 多智能体

Abstract:

Deep reinforcement learning algorithm can well achieve discrete decision-making behavior, but it is difficult to apply to the highly complex and continuous modern battlefield situations, and the algorithm is difficult to converge in multi-agent environment. To solve these problems, an improved Deep Deterministic Policy Gradient(DDPG) algorithm is proposed, which introduces the experience replay technology based on priority and single training mode to improve the convergence speed of the algorithm; at the same time, an exploration strategy of mixed double noise is designed in the algorithm to realize complex and continuous military decision-making and control behavior. The intelligent military decision simulation platform based on the improved DDPG algorithm is developed by unity3D. The simulation environment of Blue Army Infantry attacking Red Army military base is built to simulate multi-agent combat training. The experimental results show that the algorithm can drive multiple combat agents to complete tactical maneuvers and achieve tactical behaviors, such as bypassing obstacles to reach the dominant area for shooting. The algorithm has faster convergence speed and better stability. It can get higher round rewards, and achieves the purpose of improving the efficiency of intelligent military decision-making.

Key words: deep reinforcement learning, deep Q-network, deep deterministic policy gradient, intelligent military decision-making, multi-agent