结合元学习和安全区域探索的进化强化学习方法

doi:10.3778/j.issn.1002-8331.2308-0342

摘要/Abstract

摘要： 最近提出的进化强化学习(evolutionary reinforcement learning，ERL)框架表明了利用进化算法提高强化学习的探索能力对性能提升的好处。然而，现有的基于ERL的方法并没有完全解决进化算法中突变的可伸缩性问题且由于进化算法本身的限制使得ERL解决问题的速度较为缓慢。为了使算法每一步的探索都被限制在安全区域中且能在较短的时间内收敛，运用元学习的思想，预训练一个初始的种群，这个种群只需要经过几次进化就能得到任务中不错的效果。将预训练过后的种群用于处理任务，在此过程中，利用敏感度调整种群突变的范围，限制种群在安全区域内进行突变，确保种群的突变不会带来无法预料的后果。该方法在来自OpenAI gym中的五种机器人运动中进行了评估。最终在所有测试的环境中，该方法在以ERL、CEM-RL以及两种最先进的RL算法、PPO和TD3为基线的比较中，取得了具有竞争性的效果。

关键词: 进化强化学习, 元学习, 预训练, 安全区域, 突变算子

Abstract: The recently proposed framework of evolutionary reinforcement learning (ERL) has demonstrated the benefits of improving the exploration ability of evolutionary algorithm in reinforcement learning for performance improvement. However, the existing ERL-based methods do not fully solve the scalability problem of mutation in evolutionary algorithms, and the speed of ERL to solve the problem is slow due to the limitations of evolutionary algorithms. In order to make the exploration of each step of the algorithm be restricted in the safe area and converge in a short time, the idea of meta-learning is first used to pre-train an initial population, which only needs to undergo several times of evolution to get a good effect in the task. Secondly, the pre-trained population is used for processing tasks. In this process, sensitivity is used to adjust the range of population mutation, limit the population mutation in the safe area, and ensure that the population mutation will not bring unexpected consequences. The method is evaluated in five robot exercises from the OpenAI gym. Finally, in all the test environments, the method achieves competitive results in the baseline comparison of ERL, CEM-RL, and the two most advanced RL algorithms, PPO and TD3.

Key words: evolutionary reinforcement learning, meta-learning, pre-training, safe region , mutation operator

李晓益, 胡滨, 秦进, 彭安浪. 结合元学习和安全区域探索的进化强化学习方法[J]. 计算机工程与应用, 2025, 61(1): 361-367.

LI Xiaoyi, HU Bin, QIN Jin, PENG Anlang. Evolutionary Reinforcement Learning Combining Meta-Learning and Safe Region Exploration[J]. Computer Engineering and Applications, 2025, 61(1): 361-367.

参考文献

[1] SUTTON R S, BARTO A G. Reinforcement Learning: an Introduction[M]. 2nd ed. Cambridge: MIT Press, 2018: 17-35.
[2] MNIH V, KAVUKCUOGLU K, SILVER D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529-533.
[3] SIGAUD O. Combining evolution and deep reinforcement learning for policy search: a survey[J]. arXiv:2203.14009, 2022.
[4] KHADKA S, TUMER K. Evolution-guided policy gradient in reinforcement learning[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada: MIT Press: 2018: 1196-1208
[5] POURCHOT A, SIGAUD O. CEM-RL: combining evolutionary and gradient-based methods for policy search[C]//Proceedings of the 6th International Conference on Learning Representations, 2019.
[6] KHADKA S, MAJUMDAR S, NASSAR T, et al. Collaborative evolutionary reinforcement learning[C]//Proceedings of the 36th International Conference on Machine Learning, 2019: 3341-3350.
[7] BODNAR C, DAY B, LIó P. Proximal distilled evolutionary reinforcement learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 3283-3290.
[8] LV S, HAN S, ZHOU W, et al. Recruitment-imitation mechanism for evolutionary reinforcement learning[J]. Information Sciences, 2021, 553: 172-188.
[9] MARCHESINI E, CORSI D, FARINELLI A. Genetic soft updates for policy evolution in deep reinforcement learning[C]//Proceedings of the 8th International Conference on Learning Representations, 2020.
[10] HAO J Y, LI P Y, TANG H Y, et al. ERL-Re^2: efficient evolutionary reinforcement learning with shared state representation and individual policy representation[C]//Proceedings of the 11th International Conference on Learning Representations, 2023.
[11] STANLEY K O, CLUNE J, LEHMAN J, et al. Designing neural networks through neuroevolution[J]. Nature Machine Intelligence, 2019, 1: 24-35.
[12] SALEHI A, CONINX A, DONCIEUX S. Few-shot quality-diversity optimization[J]. IEEE Robotics and Automation Letters, 2022, 7(2): 4424-4431.
[13] LILLICRAP T P, HUNT J J, PRITZEL A, et al. Continuous control with deep reinforcement learning[J]. arXiv:1509.02971, 2015.
[14] 吕帅, 龚晓宇, 张正昊, 等. 结合进化算法的深度强化学习方法研究综述[J]. 计算机学报, 2022, 45(7): 1478-1499.
Lü S, GONG X Y, ZHANG Z H, et al. Survey of deep reinforcement learning methods with evolutionary algorithms[J]. Chinese Journal of Computers, 2022, 45(7): 1478-1499.
[15] FUJIMOTO S, VAN HOOF H, MEGER D J. Addressing function approximation error in actor-critic methods[C]//Proceedings of the 35th International Conference on Machine Learning, 2018: 2587-2601.
[16] FINN C, ABBEEL P, LEVINE S, et al. Model-agnostic meta-learning for fast adaptation of deep networks[C]//Proceedings of the 34th International Conference on Machine Learning - Volume 70. New York: ACM, 2017: 1126-1135.
[17] LüDERS B, SCHL?GER M, KORACH A, et al. Continual and one-shot learning through neural networks with dynamic external memory[M]//Applications of evolutionary computation. Cham: Springer, 2017: 886-901.
[18] LEHMAN J, CHEN J, CLUNE J, et al. Safe mutations for deep and recurrent neural networks through output gradients[C]//Proceedings of the Genetic and Evolutionary Computation Conference. New York: ACM, 2018: 117-124.
[19] SCHULMAN J, WOLSKI F, DHARIWAL P, et al. Proximal policy optimization algorithms[J]. arXiv:1707.6347.017.