Improved Policy Optimization Algorithm Based on Curiosity Mechanism

doi:10.3778/j.issn.1002-8331.2205-0060

Abstract

Abstract: In the generation process of reinforcement learning decision model, due to the complex environment and incomplete observation of state information, the classical proximal policy optimization algorithm faces problems such as low exploration and utilization efficiency and poor effect of generated strategies, this paper proposes an MNAEK-PPO（proximal policy optimization based on maximum number of arrival & expert knowledge algorithm） based on curiosity mechanism. Focusing on the difficult problem of exploring the strategy space, by constructing the exploration frequency matrix of the agent in the training process, the exploration frequency is treated as an internal reward to participate in the agent’s reinforcement learning and training process. In addition, expert knowledge is added to assist the agent in making decisions. Through experiments in the intelligent battlefield simulation environment, the best construction method of internal rewards in MNAEK-PPO is determined, and a series of comparative experiments are carried out. The experimental results show that MNAEK-PPO greatly improves the exploration efficiency of decision space, and the convergence speed and game score are significantly improved, which provides a new solution for promoting the application and development of deep reinforcement learning in the generation of intelligent tactical strategies.

Key words: artificial intelligence, deep reinforcement learning, curiosity mechanism, knowledge transfer, strategy optimization, intelligent tactics

摘要： 针对强化学习决策模型生成过程中，由于复杂环境和状态信息观察不完全导致经典的近端策略优化算法处理过程中面临的探索与利用效率较低、生成的策略效果较差等问题，提出了一种基于好奇心机制改进的基于最大到达次数的近端策略优化算法（proximal policy optimization based on maximum number of arrival & expert knowledge，MNAEK-PPO）。围绕策略空间的探索困难问题，通过构建智能体在训练过程中的探索频次矩阵，对探索频次进行处理后作为内在奖励参与到智能体的强化学习训练过程，此外还加入了专家知识辅助智能体进行决策。通过在智能化战场仿真环境中的实验确定了MNAEK-PPO中内在奖励的最佳构造方式，并进行了一系列对比实验，实验结果表明，MNAEK-PPO大幅提升了决策空间的探索效率，收敛速度和对局得分均有明显提升，为推动深度强化学习在智能战术策略生成中的应用与发展提供了新的解决思路。

关键词: 人工智能, 深度强化学习, 好奇心机制, 知识迁移, 策略优化, 智能战术

ZHANG Qiyang, CHEN Xiliang, CAO Lei, LAI Jun. Improved Policy Optimization Algorithm Based on Curiosity Mechanism[J]. Computer Engineering and Applications, 2023, 59(11): 63-70.

张启阳, 陈希亮, 曹雷, 赖俊. 基于好奇心机制改进的策略优化算法[J]. 计算机工程与应用, 2023, 59(11): 63-70.

References

[1] MNIH V，KAVUKCUOGLU K，SILVER D，et al.Human-level control through deep reinforcement learning[J].Nature，2015，518（7540）：529-533.
[2] GRONDMAN I，BUSONIU L，LOPES G A D，et al.A survey of actor-critic reinforcement learning：standard and natural policy gradients[J].IEEE Transactions on Systems，Man，and Cybernetics，Part C：Applications and Reviews，2012，42（6）：1291-1307.
[3] MNIH V，KAVUKCUOGLU K，SILVER D，et al.Playing atari with deep reinforcement learning[J].arXiv：1312. 5602，2013.
[4] LILLICRAP T P，HUNT J J，PRITZEL A，et al.Continuous control with deep reinforcement learning[C]//ICLR（Poster），2016.
[5] MNIH V，BADIA A P，MIRZA M，et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning，2016：1928-1937.
[6] BELLEMARE M G，SRINIVASAN S，OSTROVSKI G，et al.Unifying count-based exploration and intrinsic motivation[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems，2016：1479-1487.
[7] ECOFFET A，HUIZINGA J，LEHMAN J，et al.First return，then explore[J].Nature，2021，590（7847）：580-586.
[8] PATHAK D，AGRAWAL P，EFROS A A，et al.Curiosity-driven exploration by self-supervised prediction[C]//International Conference on Machine Learning，2017：2778-2787.
[9] 陈希亮，张永亮.基于深度强化学习的陆军分队战术决策问题研究[J].军事运筹与系统工程，2017，31（3）：20-27.
CHEN X L，ZHANG Y L.Research on tactical decision-making of army unit based on deep reinforcement learning[J].Military Operations Research and Assessment，2017，31（3）：20-27.
[10] SCHULMAN J，LEVINE S，ABBEEL P，et al.Trust region policy optimization[C]//International Conference on Machine Learning，2015：1889-1897.
[11] SCHULMAN J，WOLSKI F，DHARIWAL P，et al.Proximal policy optimization algorithms[J].arXiv：1707.06347，2017.
[12] SCHULMAN J，MORITZ P，LEVINE S，et al.High-dimensional continuous control using generalized advantage estimation[J].arXiv：1506.02438，2015.
[13] 李晨溪，曹雷，张永亮，等.基于知识的深度强化学习研究综述[J].系统工程与电子技术，2017，39（11）：2603-2613.
LI C X，CAO L，ZHANG Y L，et al.Knowledge-based deep reinforcement learning：a review[J].Systems Engineering and Electronics，2017，39（11）：2603-2613.

[1]	CHEN Jishang, Abudukelimu Halidanmu, LIANG Yunze, Abulizi Abudukelimu, Aishan Mikelayi, GUO Wenqiang. Review of Application of Deep Learning in Symbolic Music Generation [J]. Computer Engineering and Applications, 2023, 59(9): 27-45.
[2]	SUN Dan, ZHENG Jianhua, GAO Dong, HAN Peng. Mars Unmanned Aerial Vehicles Control with Deep Deterministic Policy Gradient [J]. Computer Engineering and Applications, 2023, 59(8): 288-296.
[3]	NING Qiang, LIU Yuansheng, XIE Longyang. Application of SAC-Based Autonomous Vehicle Control Method [J]. Computer Engineering and Applications, 2023, 59(8): 306-314.
[4]	HAN Runhai, CHEN Hao, LIU Quan, HUANG Jian. Intelligent Game Countermeasures Algorithm Based on Opponent Action Prediction [J]. Computer Engineering and Applications, 2023, 59(7): 190-197.
[5]	HUANG Xiaohui, LING Jiahao, ZHANG Xiong, XIONG Liyan, ZENG Hui. Online Car-Hailing Dispatch Method Based on Local Position Perception Multi-Agent [J]. Computer Engineering and Applications, 2023, 59(7): 294-301.
[6]	LI Jinchen, LI Yanling, GE Fengpei, LIN Min. Survey of Research on Intelligent System for Legal Domain [J]. Computer Engineering and Applications, 2023, 59(7): 31-50.
[7]	YANG Xiaoxiao, KE Lin, CHEN Zhibin. Review of Deep Reinforcement Learning Model Research on Vehicle Routing Problems [J]. Computer Engineering and Applications, 2023, 59(5): 1-13.
[8]	SUN Shukui, FAN Jing, LI Zhanwen, QU Jinshuai, LU Peidong. Survey of Artificial Intelligence in COVID-19 Pandemic [J]. Computer Engineering and Applications, 2023, 59(5): 28-39.
[9]	ZHAO Liyang, CHANG Tianqing, CHU Kaixuan, GUO Libin, ZHANG Lei. Survey of Fully Cooperative Multi-Agent Deep Reinforcement Learning [J]. Computer Engineering and Applications, 2023, 59(12): 14-27.
[10]	WANG Zheng’an, XU Zhenshun, LIN Lingde. Review of COVID-19 Propagation Prediction Methods [J]. Computer Engineering and Applications, 2023, 59(12): 49-61.
[11]	LIANG Tiankai, SU Xinduo, HUANG Yuheng, XU Tianshi, ZHANG Huajun, ZENG Bi. Survey on Intelligent Table Recognition [J]. Computer Engineering and Applications, 2023, 59(12): 62-76.
[12]	LI Shouyu, HE Qing. Improved Feature Selection for Marine Predator Algorithms [J]. Computer Engineering and Applications, 2023, 59(11): 168-179.
[13]	WANG Xin, ZHAO Kai, QIN Bin. Review of WebAssembly Application Research for Edge Serverless Computing [J]. Computer Engineering and Applications, 2023, 59(11): 28-36.
[14]	WU Zhou, ZHANG Hongrui, ZHANG Haijun, SONG Qing. Summary of Research and Application of Neighborhood Field Optimization Algorithm [J]. Computer Engineering and Applications, 2022, 58(9): 1-8.
[15]	WEI Tingting, YUAN Weilin, LUO Junren, ZHANG Wanpeng. Survey of Opponent Modeling Methods and Applications in Intelligent Game Confrontation [J]. Computer Engineering and Applications, 2022, 58(9): 19-29.