Computer Engineering and Applications ›› 2019, Vol. 55 ›› Issue (20): 101-106.DOI: 10.3778/j.issn.1002-8331.1806-0394

Previous Articles     Next Articles

Deep Q Net Based on Advantage Learning

XIA Zongtao, QIN Jin   

  1. College of Computer Science and Technology, Guizhou University, Guiyang 550025, China
  • Online:2019-10-15 Published:2019-10-14

基于优势学习的深度Q网络

夏宗涛,秦进   

  1. 贵州大学 计算机科学与技术学院,贵阳 550025

Abstract: In the reinforcement learning problem, the different state-action value corresponding to the different action in the same state may be too small, Q-Learning algorithm will have overestimation problem when using MAX to select an action, and Deep Q Net(DQN) which combined with Q-Learning also has overestimation problem, In order to alleviate the overestimation problem in deep Q net, a deep Q net based on advantage learning is proposed. A correction item is constructed by the method of advantage learning, and modeling this correction by target value network, summing up the evaluation function Q of the deep Q net with the correction item as a new evaluation function. When the selected action is the optimal action, the correction item is zero, and the value of evaluation function and the Q is not changed. when the selected action is not the optimal action, the value of the correction is negative, and the value of the non optimal action is reduced. Compared with the traditional deep Q net, the deep Q net based on advantage learning has achieved a higher average reward in the Playing Atari 2600 control problems, such as breakout, seaquest, phoenix, amidar and a more stable strategy has been achieved in krull and seaquest.

Key words: reinforcement learning, advantage learning, Deep Q Net(DQN), overestimation

摘要: 强化学习问题中,同一状态下不同动作所对应的状态-动作值存在差距过小的现象,Q-Learning算法采用MAX进行动作选择时会出现过估计问题,且结合了Q-Learning的深度Q网络(Deep Q Net)同样存在过估计问题。为了缓解深度Q网络中存在的过估计问题,提出一种基于优势学习的深度Q网络,通过优势学习的方法构造一个更正项,利用目标值网络对更正项进行建模,同时与深度Q网络的评估函数进行求和作为新的评估函数。当选择的动作是最优动作时,更正项为零,不对评估函数的值进行改动,当选择的动作不是最优动作时,更正项的值为负,降低了非最优动作的评估值。和传统的深度Q网络相比,基于优势学习的深度Q网络在Playing Atari 2600的控制问题breakout、seaquest、phoenix、amidar中取得了更高的平均奖赏值,在krull、seaquest中取得了更加稳定的策略。

关键词: 强化学习, 优势学习, 深度Q网络, 过估计问题