Adaptive ε-greedy Strategy Based on Average Episodic Cumulative Reward

doi:10.3778/j.issn.1002-8331.2003-0019

Abstract

Abstract:

The trade-off between exploration and exploitation is one of the challenges of reinforcement learning. The exploration makes the agent take new actions to improve the policy while the exploitation makes the agent use the information from the historical experiences to maximize the cumulative reward. The “ε-greedy” strategy commonly used in deep reinforcement learning deals with the trade-off between exploration and exploitation, without considering other factors that affect the decision-making of the agent, so the ε-greedy strategy is of some blindness. To solve this problem, an adaptive ε-greedy strategy based on adjustment of the exploration factor is proposed. This strategy guides the agent to conduct exploration or exploitation reasonably based on the episodic cumulative reward received by the agent each task. The larger the episodic cumulative reward, the more effective actions taken by the current agent. The adaptive strategy reduces the exploration factor to make more use of historical experiences. Conversely, a smaller episodic cumulative reward means that the current policy can be improved. The adaptive strategy increases the exploration factor to explore more possible actions. Experimental results show that the improved strategy achieves higher average rewards in the Playing Atari 2600, It’s indicated that the improved strategy can better trade off between exploration and exploitation.

Key words: deep reinforcement learning, exploration and exploitation, episodic cumulative reward, ε-greedy strategy

摘要：

探索与利用的权衡是强化学习的挑战之一。探索使智能体为进一步改进策略而采取新的动作，而利用使智能体采用历史经验中的信息以最大化累计奖赏。深度强化学习中常用“[ε]-greedy”策略处理探索与利用的权衡问题，未考虑影响智能体做出决策的其他因素，具有一定的盲目性。针对此问题提出一种自适应调节探索因子的[ε]-greedy策略，该策略依据智能体每完成一次任务所获得的序列累计奖赏值指导智能体进行合理的探索或利用。序列累计奖赏值越大，说明当前智能体所采用的有效动作越多，减小探索因子以便更多地利用历史经验。反之，序列累计奖赏值越小，说明当前策略还有改进的空间，增大探索因子以便探索更多可能的动作。实验结果证明改进的策略在Playing Atari 2600视频游戏中取得了更高的平均奖赏值，说明改进的策略能更好地权衡探索与利用。

关键词: 深度强化学习, 探索与利用, 序列累计奖赏, &epsilon, -greedy策略

YANG Tong, QIN Jin. Adaptive ε-greedy Strategy Based on Average Episodic Cumulative Reward[J]. Computer Engineering and Applications, 2021, 57(11): 148-155.

杨彤，秦进. 基于平均序列累计奖赏的自适应ε-greedy策略[J]. 计算机工程与应用, 2021, 57(11): 148-155.

[1]	MA Zhihao, ZHU Xiangbin. Research on Quasi-hyperbolic Momentum Gradient for Adversarial Deep Reinforcement Learning [J]. Computer Engineering and Applications, 2021, 57(24): 90-99.
[2]	LI Baoshuai, YE Chunming. Job Shop Scheduling Problem Based on Deep Reinforcement Learning [J]. Computer Engineering and Applications, 2021, 57(23): 248-254.
[3]	CHENG Yi, HAO Mimi. Path Planning for Indoor Mobile Robot with Improved Deep Reinforcement Learning [J]. Computer Engineering and Applications, 2021, 57(21): 256-262.
[4]	KUANG Liqun, LI Siyuan, FENG Li, HAN Xie, XU Qingyu. Application of Deep Reinforcement Learning Algorithm on Intelligent Military Decision System [J]. Computer Engineering and Applications, 2021, 57(20): 271-278.
[5]	KONG Songtao, LIU Chichi, SHI Yong, XIE Yi, WANG Kun. Review of Application Prospect of Deep Reinforcement Learning in Intelligent Manufacturing [J]. Computer Engineering and Applications, 2021, 57(2): 49-59.
[6]	SONG Haonan, ZHAO Gang, WANG Xingfen. Knowledge Reasoning Method Combining Knowledge Representation with Deep Reinforcement Learning [J]. Computer Engineering and Applications, 2021, 57(19): 189-197.
[7]	ZHANG Rongxia, WU Changxu, SUN Tongchao, ZHAO Zengshun. Progress on Deep Reinforcement Learning in Path Planning [J]. Computer Engineering and Applications, 2021, 57(19): 44-56.
[8]	YANG Xueyu, CHEN Jianping, FU Qiming, LU You, WU Hongjie. Deep Deterministic Policy Gradient Algorithm Based on Stochastic Variance Reduction Method [J]. Computer Engineering and Applications, 2021, 57(19): 104-111.
[9]	SUN Yu, CAO Lei, CHEN Xiliang, XU Zhixiong, LAI Jun. Overview of Multi-Agent Deep Reinforcement Learning [J]. Computer Engineering and Applications, 2020, 56(5): 13-24.
[10]	HAN Daoqi, ZHANG Junyao, ZHOU Yuhang, LIU Qing. Research on Intelligent Trader Model Based on Deep Reinforcement Learning [J]. Computer Engineering and Applications, 2020, 56(21): 145-153.
[11]	LI Yue, SHAO Zhenzhou, ZHAO Zhendong, SHI Zhiping, GUAN Yong. Design of Reward Function in Deep Reinforcement Learning for Trajectory Planning [J]. Computer Engineering and Applications, 2020, 56(2): 226-232.
[12]	LAI Jun, RAO Rui. Application of Deep Reinforcement Learning in Indoor UAV Target Search [J]. Computer Engineering and Applications, 2020, 56(17): 156-160.
[13]	HUANG Dongjin, JIANG Chenfeng, HAN Kaili. 3D Path Planning Algorithm Based on Deep Reinforcement Learning [J]. Computer Engineering and Applications, 2020, 56(15): 30-36.
[14]	XU Zhixiong, CAO Lei, ZHANG Yongliang, CHEN Xiliang, LI Chenxi. Research on Deep Reinforcement Learning Algorithm Based on Dynamic Fusion Target [J]. Computer Engineering and Applications, 2019, 55(7): 157-161.
[15]	ZHANG Bin1, HE Ming1，2, CHEN Xiliang1, WU Chunxiao1, LIU Bin1, ZHOU Bo1. Self-Driving Via Improved DDPG Algorithm [J]. Computer Engineering and Applications, 2019, 55(10): 264-270.

Adaptive ε-greedy Strategy Based on Average Episodic Cumulative Reward

基于平均序列累计奖赏的自适应ε-greedy策略

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics