基于优势学习的深度Q网络

doi:10.3778/j.issn.1002-8331.1806-0394

计算机工程与应用 ›› 2019, Vol. 55 ›› Issue (20): 101-106.DOI: 10.3778/j.issn.1002-8331.1806-0394

基于优势学习的深度Q网络

夏宗涛，秦进

贵州大学计算机科学与技术学院，贵阳 550025

出版日期:2019-10-15 发布日期:2019-10-14

Deep Q Net Based on Advantage Learning

XIA Zongtao, QIN Jin

College of Computer Science and Technology, Guizhou University, Guiyang 550025, China

Online:2019-10-15 Published:2019-10-14

摘要/Abstract

摘要： 强化学习问题中，同一状态下不同动作所对应的状态-动作值存在差距过小的现象，Q-Learning算法采用MAX进行动作选择时会出现过估计问题，且结合了Q-Learning的深度Q网络（Deep Q Net）同样存在过估计问题。为了缓解深度Q网络中存在的过估计问题，提出一种基于优势学习的深度Q网络，通过优势学习的方法构造一个更正项，利用目标值网络对更正项进行建模，同时与深度Q网络的评估函数进行求和作为新的评估函数。当选择的动作是最优动作时，更正项为零，不对评估函数的值进行改动，当选择的动作不是最优动作时，更正项的值为负，降低了非最优动作的评估值。和传统的深度Q网络相比，基于优势学习的深度Q网络在Playing Atari 2600的控制问题breakout、seaquest、phoenix、amidar中取得了更高的平均奖赏值，在krull、seaquest中取得了更加稳定的策略。

关键词: 强化学习, 优势学习, 深度Q网络, 过估计问题

Abstract: In the reinforcement learning problem, the different state-action value corresponding to the different action in the same state may be too small, Q-Learning algorithm will have overestimation problem when using MAX to select an action, and Deep Q Net（DQN） which combined with Q-Learning also has overestimation problem, In order to alleviate the overestimation problem in deep Q net, a deep Q net based on advantage learning is proposed. A correction item is constructed by the method of advantage learning, and modeling this correction by target value network, summing up the evaluation function Q of the deep Q net with the correction item as a new evaluation function. When the selected action is the optimal action, the correction item is zero, and the value of evaluation function and the Q is not changed. when the selected action is not the optimal action, the value of the correction is negative, and the value of the non optimal action is reduced. Compared with the traditional deep Q net, the deep Q net based on advantage learning has achieved a higher average reward in the Playing Atari 2600 control problems, such as breakout, seaquest, phoenix, amidar and a more stable strategy has been achieved in krull and seaquest.

Key words: reinforcement learning, advantage learning, Deep Q Net（DQN）, overestimation

夏宗涛，秦进. 基于优势学习的深度Q网络[J]. 计算机工程与应用, 2019, 55(20): 101-106.

XIA Zongtao, QIN Jin. Deep Q Net Based on Advantage Learning[J]. Computer Engineering and Applications, 2019, 55(20): 101-106.

[1]	张鑫，张席. 优先状态估计的双深度Q网络[J]. 计算机工程与应用, 2021, 57(8): 78-83.
[2]	王晓，唐伦，贺小雨，陈前斌. 基于深度强化学习的服务功能链多维资源优化[J]. 计算机工程与应用, 2021, 57(4): 68-76.
[3]	赖俊，魏竞毅，陈希亮. 分层强化学习综述[J]. 计算机工程与应用, 2021, 57(3): 72-79.
[4]	马志豪，朱响斌. 拟双曲动量梯度的对抗深度强化学习研究[J]. 计算机工程与应用, 2021, 57(24): 90-99.
[5]	李宝帅，叶春明. 深度强化学习算法求解作业车间调度问题[J]. 计算机工程与应用, 2021, 57(23): 248-254.
[6]	王军，曹雷，陈希亮，赖俊，章乐贵. 多智能体博弈强化学习研究综述[J]. 计算机工程与应用, 2021, 57(21): 1-13.
[7]	成怡，郝密密. 改进深度强化学习的室内移动机器人路径规划[J]. 计算机工程与应用, 2021, 57(21): 256-262.
[8]	况立群，李思远，冯利，韩燮，徐清宇. 深度强化学习算法在智能军事决策中的应用[J]. 计算机工程与应用, 2021, 57(20): 271-278.
[9]	李浩，宁浩宇，康雁，梁文韬，霍雯. 针对文本情感转换的SMRFGAN模型[J]. 计算机工程与应用, 2021, 57(2): 170-176.
[10]	孔松涛，刘池池，史勇，谢义，王堃. 深度强化学习在智能制造中的应用展望综述[J]. 计算机工程与应用, 2021, 57(2): 49-59.
[11]	张荣霞，武长旭，孙同超，赵增顺. 深度强化学习及在路径规划中的研究进展[J]. 计算机工程与应用, 2021, 57(19): 44-56.
[12]	杨薛钰，陈建平，傅启明，陆悠，吴宏杰. 基于随机方差减小方法的DDPG算法[J]. 计算机工程与应用, 2021, 57(19): 104-111.
[13]	宋浩楠，赵刚，王兴芬. 融合知识表示和深度强化学习的知识推理方法[J]. 计算机工程与应用, 2021, 57(19): 189-197.
[14]	王科银，石振，杨正才，杨亚会，王思山. 改进强化学习算法应用于移动机器人路径规划[J]. 计算机工程与应用, 2021, 57(18): 270-274.
[15]	张俊，朱庆伟，严俊杰，温波. 改进强化学习算法的UAV室内三维航迹规划[J]. 计算机工程与应用, 2021, 57(16): 175-181.

基于优势学习的深度Q网络

Deep Q Net Based on Advantage Learning

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics