Deep Deterministic Policy Gradient Algorithm Based on Stochastic Variance Reduction Method

doi:10.3778/j.issn.1002-8331.2009-0097

Computer Engineering and Applications ›› 2021, Vol. 57 ›› Issue (19): 104-111.DOI: 10.3778/j.issn.1002-8331.2009-0097

Previous Articles Next Articles

Deep Deterministic Policy Gradient Algorithm Based on Stochastic Variance Reduction Method

YANG Xueyu, CHEN Jianping, FU Qiming, LU You, WU Hongjie

1.School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, Jiangsu 215009, China
2.Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency, Suzhou University of Science and Technology, Suzhou, Jiangsu 215009, China
3.Suzhou Key Laboratory of Mobile Networking and Applied Technologies, Suzhou University of Science and Technology, Suzhou, Jiangsu 215009, China
4.Zhuhai Mizao Intelligent Technology Co., Ltd., Zhuhai, Guangdong 519000, China
5.Virtual Reality Key Laboratory of Intelligent Interaction and Application Technology of Suzhou, Suzhou University of Science and Technology, Suzhou, Jiangsu 215009, China

Online:2021-10-01 Published:2021-09-29

基于随机方差减小方法的DDPG算法

杨薛钰，陈建平，傅启明，陆悠，吴宏杰

1.苏州科技大学电子与信息工程学院，江苏苏州 215009
2.苏州科技大学江苏省建筑智慧节能重点实验室，江苏苏州 215009
3.苏州科技大学苏州市移动网络技术与应用重点实验室，江苏苏州 215009
4.珠海米枣智能科技有限公司，广东珠海 519000
5.苏州科技大学苏州市虚拟现实智能交互与应用技术重点实验室，江苏苏州 215009

Abstract

Abstract:

Aiming at the problem that the Deep Deterministic Policy Gradient（DDPG） algorithm has slow convergence speed, training instability, large variance and poor sample efficiency. This paper proposes a deep deterministic policy gradient algorithm by utilizing Stochastic Variance Reduced Gradient（SVRG） algorithm. By utilizing stochastic variance reduced techniques, it proposes an innovative optimization strategy, applying it to DDPG algorithm. In the parameter update process of the DDPG algorithm, by using the update mode of this method, the estimated gradient variance has a decreasing upper bound, so that the variance decreases continuously, so as to find a more accurate gradient direction on the basis of a small random training subset. This strategy solves the problem caused by the approximate gradient error, speeds up the convergence speed of the algorithm. Applying SVR-DDPG algorithm and DDPG algorithm to Pendulum and Mountain Car problems, experimental results show that the SVR-DDPG algorithm has a faster convergence rate and better stability than the original algorithm, which proves the effectiveness of the algorithm.

Key words: deep reinforcement learning, Deep Q-Network（DQN）, Deep Deterministic Policy Gradient（DDPG）, stochastic variance reduced techniques

摘要：

针对深度确定性策略梯度算法（DDPG）收敛速度比较慢，训练不稳定，方差过大，样本应用效率低的问题，提出了一种基于随机方差减小梯度方法的深度确定性策略梯度算法（SVR-DDPG）。该算法通过利用随机方差减小梯度技术（SVRG）提出一种新的创新优化策略，将之运用到DDPG算法之中，在DDPG算法的参数更新过程中，加入了随机方差减小梯度技术，利用该方法的更新方式，使得估计的梯度方差有一个不断减小的上界，令方差不断缩小，从而在小的随机训练子集的基础上找到更加精确的梯度方向，以此来解决了由近似梯度估计误差引发的问题，加快了算法的收敛速度。将SVR-DDPG算法以及DDPG算法应用于Pendulum和Mountain Car问题，实验结果表明，SVR-DDPG算法具有比原算法更快的收敛速度，更好的稳定性，以此证明了算法的有效性。

关键词: 深度强化学习, 深度Q学习算法（DQN）, 深度确定性策略梯度算法（DDPG）, 随机方差缩减梯度技术

YANG Xueyu, CHEN Jianping, FU Qiming, LU You, WU Hongjie. Deep Deterministic Policy Gradient Algorithm Based on Stochastic Variance Reduction Method[J]. Computer Engineering and Applications, 2021, 57(19): 104-111.

杨薛钰，陈建平，傅启明，陆悠，吴宏杰. 基于随机方差减小方法的DDPG算法[J]. 计算机工程与应用, 2021, 57(19): 104-111.

References

[1] SUTTON R S，BARTO G A.Reinforcement learning：An introduction[M].Cambridge：MIT Press，1998.
[2] MNIH V，KAVUKCUOGLU K，SILVER D，et al.Playing Atari with deep reinforcement learning[J].arXiv：1312. 5602，2013.
[3] 陈培，王超，王德奎，等.针对分布式深度学习训练的Kubernetes集群网络拓扑调度算法[J].信息技术与信息化，2019（9）：109-113.
CHEN P，WANG C，WANG D K，et al.Kubernetes cluster network topology scheduling algorithm for distributed deep learning training[J].Information Technology and Informatization，2019（9）：109-113.
[4] 孙志军，薛磊，许阳明，等.深度学习研究综述[J].计算机应用研究，2012，29（8）：2806-2810.
SUN Z J，XUE L，XU Y M，et al.A review of deep learning research[J].Application Research of Computers，2012，29（8）：2806-2810.
[5] SILVER D，HUANG A，MADDISONC J，et al.Mastering the game of go with deep neural networks and tree search[J].Nature，2016，529：484-489.
[6] SILVER D，SCHRITTWIESER J，SIMONYAN K，et al.Mastering the game of go without human knowledge[J].Nature，2017，550：354-359.
[7] WATKINS C J C H.Learning from delayed rewards[D].Cambridge：Cambridge University，1989.
[8] VAN HASSELT H，GUEZ A，SILVER D.Deep reinforcement learning with double q-learning[J].arXiv：1598.06461，2015.
[9] WANG Z，SCHAUL T，HESSEL M，et al.Dueling network architectures for deep reinforcement learning[J].arXiv：1511.06581，2015.
[10] FORTUNATO M，AZAR M G，PIOT B，et al.Noisy networks for exploration[J].arXiv：1706.10295，2017.
[11] 陈建平，何超，刘全，等.增强型深度确定策略梯度算法[J].通信学报，2018，39（11）：106-115.
CHEN J P，HE C，LIU Q，et al.Enhanced deep deterministic policy gradient[J].Journal of Communications，2018，39（11）：106-115.
[12] PAIK S，SHAK S，TANG G，et al.A multigene assay to predict recurrence of tamoxifen-treated，node-negative breast cancer[J].New England Journal of Medicine，2004，351：2817.
[13] LILLICRAP T P，HUNT J J，PRITZEL A，et al.Continuous control with deep reinforcement learning[J].Computer Science，2015，8（6）：187.
[14] SILVER D，LEVER G，HEESS N，et al.Deterministic policy gradient algorithms[C]//Proceedings of the International Conference on Machine Learning，2014.
[15] 周志华.机器学习[M].北京：清华大学出版社，2016：377-382.
ZHOU Z H.Machine learning[M].Beijing：Tshinghua University Press，2016：377-382.
[16] KONDA V R，TSITSIKLIS J N.On Actor-critic algorithms[J].SIAM Journal on Control and Optimization，2000，42（4）：1143-1166.
[17] KINGMA D P，BA J.Adam：A method for stochastic optimization[J].arXiv：1412.6980，2014.

Deep Deterministic Policy Gradient Algorithm Based on Stochastic Variance Reduction Method

基于随机方差减小方法的DDPG算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

[1]	ZHOU Youhang, ZHAO Hanyun, LIU Hanjiang, LI Yuze, XIAO Yuqin. Self-Learning Gait Planning Method for Biped Robot Using DDPG [J]. Computer Engineering and Applications, 2021, 57(6): 254-259.
[2]	MA Zhihao, ZHU Xiangbin. Research on Quasi-hyperbolic Momentum Gradient for Adversarial Deep Reinforcement Learning [J]. Computer Engineering and Applications, 2021, 57(24): 90-99.
[3]	LI Baoshuai, YE Chunming. Job Shop Scheduling Problem Based on Deep Reinforcement Learning [J]. Computer Engineering and Applications, 2021, 57(23): 248-254.
[4]	CHENG Yi, HAO Mimi. Path Planning for Indoor Mobile Robot with Improved Deep Reinforcement Learning [J]. Computer Engineering and Applications, 2021, 57(21): 256-262.
[5]	KUANG Liqun, LI Siyuan, FENG Li, HAN Xie, XU Qingyu. Application of Deep Reinforcement Learning Algorithm on Intelligent Military Decision System [J]. Computer Engineering and Applications, 2021, 57(20): 271-278.
[6]	KONG Songtao, LIU Chichi, SHI Yong, XIE Yi, WANG Kun. Review of Application Prospect of Deep Reinforcement Learning in Intelligent Manufacturing [J]. Computer Engineering and Applications, 2021, 57(2): 49-59.
[7]	ZHANG Rongxia, WU Changxu, SUN Tongchao, ZHAO Zengshun. Progress on Deep Reinforcement Learning in Path Planning [J]. Computer Engineering and Applications, 2021, 57(19): 44-56.
[8]	SONG Haonan, ZHAO Gang, WANG Xingfen. Knowledge Reasoning Method Combining Knowledge Representation with Deep Reinforcement Learning [J]. Computer Engineering and Applications, 2021, 57(19): 189-197.
[9]	YANG Tong, QIN Jin. Adaptive ε-greedy Strategy Based on Average Episodic Cumulative Reward [J]. Computer Engineering and Applications, 2021, 57(11): 148-155.
[10]	SUN Yu, CAO Lei, CHEN Xiliang, XU Zhixiong, LAI Jun. Overview of Multi-Agent Deep Reinforcement Learning [J]. Computer Engineering and Applications, 2020, 56(5): 13-24.
[11]	HAN Daoqi, ZHANG Junyao, ZHOU Yuhang, LIU Qing. Research on Intelligent Trader Model Based on Deep Reinforcement Learning [J]. Computer Engineering and Applications, 2020, 56(21): 145-153.
[12]	LI Yue, SHAO Zhenzhou, ZHAO Zhendong, SHI Zhiping, GUAN Yong. Design of Reward Function in Deep Reinforcement Learning for Trajectory Planning [J]. Computer Engineering and Applications, 2020, 56(2): 226-232.
[13]	LAI Jun, RAO Rui. Application of Deep Reinforcement Learning in Indoor UAV Target Search [J]. Computer Engineering and Applications, 2020, 56(17): 156-160.
[14]	HUANG Dongjin, JIANG Chenfeng, HAN Kaili. 3D Path Planning Algorithm Based on Deep Reinforcement Learning [J]. Computer Engineering and Applications, 2020, 56(15): 30-36.
[15]	XU Zhixiong, CAO Lei, ZHANG Yongliang, CHEN Xiliang, LI Chenxi. Research on Deep Reinforcement Learning Algorithm Based on Dynamic Fusion Target [J]. Computer Engineering and Applications, 2019, 55(7): 157-161.