计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (11): 63-70.DOI: 10.3778/j.issn.1002-8331.2205-0060

• 理论与研发 • 上一篇    下一篇

基于好奇心机制改进的策略优化算法

张启阳,陈希亮,曹雷,赖俊   

  1. 陆军工程大学 指挥控制工程学院,南京 210007
  • 出版日期:2023-06-01 发布日期:2023-06-01

Improved Policy Optimization Algorithm Based on Curiosity Mechanism

ZHANG Qiyang, CHEN Xiliang, CAO Lei, LAI Jun   

  1. College of Command and Control Engineering, Army Engineering University of PLA, Nanjing 210007, China
  • Online:2023-06-01 Published:2023-06-01

摘要: 针对强化学习决策模型生成过程中,由于复杂环境和状态信息观察不完全导致经典的近端策略优化算法处理过程中面临的探索与利用效率较低、生成的策略效果较差等问题,提出了一种基于好奇心机制改进的基于最大到达次数的近端策略优化算法(proximal policy optimization based on maximum number of arrival & expert knowledge,MNAEK-PPO)。围绕策略空间的探索困难问题,通过构建智能体在训练过程中的探索频次矩阵,对探索频次进行处理后作为内在奖励参与到智能体的强化学习训练过程,此外还加入了专家知识辅助智能体进行决策。通过在智能化战场仿真环境中的实验确定了MNAEK-PPO中内在奖励的最佳构造方式,并进行了一系列对比实验,实验结果表明,MNAEK-PPO大幅提升了决策空间的探索效率,收敛速度和对局得分均有明显提升,为推动深度强化学习在智能战术策略生成中的应用与发展提供了新的解决思路。

关键词: 人工智能, 深度强化学习, 好奇心机制, 知识迁移, 策略优化, 智能战术

Abstract: In the generation process of reinforcement learning decision model, due to the complex environment and incomplete observation of state information, the classical proximal policy optimization algorithm faces problems such as low exploration and utilization efficiency and poor effect of generated strategies, this paper proposes an MNAEK-PPO(proximal policy optimization based on maximum number of arrival & expert knowledge algorithm) based on curiosity mechanism. Focusing on the difficult problem of exploring the strategy space, by constructing the exploration frequency matrix of the agent in the training process, the exploration frequency is treated as an internal reward to participate in the agent’s reinforcement learning and training process. In addition, expert knowledge is added to assist the agent in making decisions. Through experiments in the intelligent battlefield simulation environment, the best construction method of internal rewards in MNAEK-PPO is determined, and a series of comparative experiments are carried out. The experimental results show that MNAEK-PPO greatly improves the exploration efficiency of decision space, and the convergence speed and game score are significantly improved, which provides a new solution for promoting the application and development of deep reinforcement learning in the generation of intelligent tactical strategies.

Key words: artificial intelligence, deep reinforcement learning, curiosity mechanism, knowledge transfer, strategy optimization, intelligent tactics