计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (11): 148-155.DOI: 10.3778/j.issn.1002-8331.2003-0019

• 模式识别与人工智能 • 上一篇    下一篇

基于平均序列累计奖赏的自适应ε-greedy策略

杨彤,秦进   

  1. 贵州大学 计算机科学与技术学院,贵阳 550025
  • 出版日期:2021-06-01 发布日期:2021-05-31

Adaptive ε-greedy Strategy Based on Average Episodic Cumulative Reward

YANG Tong, QIN Jin   

  1. College of Computer Science & Technology, Guizhou University, Guiyang 550025, China
  • Online:2021-06-01 Published:2021-05-31

摘要:

探索与利用的权衡是强化学习的挑战之一。探索使智能体为进一步改进策略而采取新的动作,而利用使智能体采用历史经验中的信息以最大化累计奖赏。深度强化学习中常用“[ε]-greedy”策略处理探索与利用的权衡问题,未考虑影响智能体做出决策的其他因素,具有一定的盲目性。针对此问题提出一种自适应调节探索因子的[ε]-greedy策略,该策略依据智能体每完成一次任务所获得的序列累计奖赏值指导智能体进行合理的探索或利用。序列累计奖赏值越大,说明当前智能体所采用的有效动作越多,减小探索因子以便更多地利用历史经验。反之,序列累计奖赏值越小,说明当前策略还有改进的空间,增大探索因子以便探索更多可能的动作。实验结果证明改进的策略在Playing Atari 2600视频游戏中取得了更高的平均奖赏值,说明改进的策略能更好地权衡探索与利用。

关键词: 深度强化学习, 探索与利用, 序列累计奖赏, &epsilon, -greedy策略

Abstract:

The trade-off between exploration and exploitation is one of the challenges of reinforcement learning. The exploration makes the agent take new actions to improve the policy while the exploitation makes the agent use the information from the historical experiences to maximize the cumulative reward. The “ε-greedy” strategy commonly used in deep reinforcement learning deals with the trade-off between exploration and exploitation, without considering other factors that affect the decision-making of the agent, so the ε-greedy strategy is of some blindness. To solve this problem, an adaptive ε-greedy strategy based on adjustment of the exploration factor is proposed. This strategy guides the agent to conduct exploration or exploitation reasonably based on the episodic cumulative reward received by the agent each task. The larger the episodic cumulative reward, the more effective actions taken by the current agent. The adaptive strategy reduces the exploration factor to make more use of historical experiences. Conversely, a smaller episodic cumulative reward means that the current policy can be improved. The adaptive strategy increases the exploration factor to explore more possible actions. Experimental results show that the improved strategy achieves higher average rewards in the Playing Atari 2600, It’s indicated that the improved strategy can better trade off between exploration and exploitation.

Key words: deep reinforcement learning, exploration and exploitation, episodic cumulative reward, ε-greedy strategy