Review of Research on Approximate Reinforcement Learning Algorithms

doi:10.3778/j.issn.1002-8331.2112-0082

Abstract

Abstract: Reinforcement learning（RL） is one of the most important techniques for artificial intelligence（AI）. However, traditional tabular reinforcement learning is difficult to deal with control problems with large scale or continuous space. Approximate reinforcement learning is inspired by the idea of function approximation to parameterize the value function or strategy function, and obtains the optimal strategy indirectly through parameter optimization. It has been widely used in video games, Go game, robot control, etc. and obtained remarkable performance. In view of this, this paper reviews the research status and application progress of approximate reinforcement learning algorithms. Firstly, the basic theory of approximate reinforcement learning is introduced. Then the classical algorithms of approximate reinforcement learning are classified and expounded, including some corresponding improvement methods. Finally, the research progress of approximate reinforcement learning in robotics is summarized, and some major problems are summarized to provide reference for future research.

Key words: reinforcement learning, continuous space, value function approximation, direct policy search, policy gradient

摘要： 强化学习用于解决无模型情况下的优化决策问题，是实现人工智能的重要技术之一，但传统的表格型强化学习方法难以处理具有大规模、连续空间的控制问题。近似强化学习受到函数逼近思想的启发，对价值函数或策略函数参数化表示，通过参数优化间接获得最优行为策略，在视频游戏、棋类对抗及机器人控制等领域应用效果显著。基于此，对近似强化学习算法的研究现状与应用进展进行了梳理和综述。介绍了近似强化学习相关的基础理论；分类总结了近似强化学习的经典算法及一些相应的改进方法；概述了近似强化学习在机器人控制领域的研究进展，并总结了当前面临的若干主要问题，为后续的研究提供参考。

关键词: 强化学习, 连续空间, 值函数近似, 直接策略搜索, 策略梯度

SI Yanna, PU Jiexin, SUN Lifan. Review of Research on Approximate Reinforcement Learning Algorithms[J]. Computer Engineering and Applications, 2022, 58(8): 33-44.

司彦娜, 普杰信, 孙力帆. 近似强化学习算法研究综述[J]. 计算机工程与应用, 2022, 58(8): 33-44.

References

[1] SILVER D，HUANG A，MADDISON C J，et al.Mastering the game of Go with deep neural networks and tree search[J].Nature，2016，529：484-489.
[2] KIEU S，BADE A，HIJAZI M，et al.A survey of deep learning for lung disease detection on medical images：state-of-the-art，taxonomy，issues and future directions[J].Journal of Imaging，2020，6（12）：131.
[3] ZHAO Z Q，ZHENG P，XU S，et al.Object detection with deep learning：a review[J].IEEE Transactions on Neural Networks and Learning Systems，2019，30（11）：3212-3232.
[4] DARGAN S，KUMAR M，AYYAGARI M R，et al.A survey of deep learning and its applications：a new paradigm to machine learning[J].Archives of Computational Methods in Engineering，2020，27（4）：1071-1092.
[5] BUSONIU L，DE BRUIN T，TOLIC D，et al.Reinforcement learning for control：performance，stability，and deep approximators[J].Annual Review in Control，2018，46：8-28.
[6] ZHOU S K，LE H N，LUU K，et al.Deep reinforcement learning in medical imaging：a literature review[J].Medical Image Analysis，2021，73：102193.
[7] GUPTA S，SINGAL G，GARG D.Deep reinforcement learning techniques in diversified domains：a survey[J].Archives of Computational Methods in Engineering，2021，28（7）：1-40.
[8] IBRAHIM A M，YAU K L A，CHONG Y W，et al.Applications of multi-agent deep reinforcement learning：models and algorithms[J].Applied Sciences，2021，11（22）：10870.
[9] HUANG C，CHEN G，GONG Y，et al.Buffer-aided relay selection for cooperative hybrid NOMA/OMA networks with asynchronous deep reinforcement learning[J].IEEE Journal on Selected Areas in Communications，2021，39（8）：2514-2525.
[10] TORTORA M，CORDELLI E，SICILIA R，et al.Deep reinforcement learning for fractionated radiotherapy in non-small cell lung carcinoma[J].Artificial Intelligence in Medicine，2021，119：102137.
[11] TORRENTS-BARRENA J，PIELLA G，GRATACOS E，et al.Deep Q-CapsNet reinforcement learning framework for intrauterine cavity segmentation in TTTS fetal surgery planning[J].IEEE Transactions on Medical Imaging，2020，39（10）：3113-3124.
[12] SUN Y，CHENG J，ZHANG G，et al.Mapless motion planning system for an autonomous underwater vehicle using policy gradient-based deep reinforcement learning[J].Journal of Intelligent & Robotic Systems，2019，96（3/4）：591-601.
[13] SHANTIA A，TIMMERS R，CHONG Y，et al.Two-stage visual navigation by deep neural networks and multi-goal reinforcement learning[J].Robotics and Autonomous Systems，2021，138：103731.
[14] LAI Y H，WU T C，LAI C F，et al.Cognitive optimal-setting control of AIoT industrial applications with deep reinforcement learning[J].IEEE Transactions on Industrial Informatics，2020，17（3）：2116-2123.
[15] LEE J，KOH H，CHOE H J.Learning to trade in financial time series using high-frequency through wavelet transformation and deep reinforcement learning[J].Applied Intelligence，2021，51（8）：6202-6223.
[16] BRADTKE S J，BARTO A G.Linear least-squares algorithms for temporal difference learning[J].Machine Learning，1996，22（1/2/3）：33-57.
[17] XU X，HE H，HU D.Efficient reinforcement learning using recursive least-squares methods[J].Journal of Artificial Intelligence Research，2002，16：259-292.
[18] LAGOUDAKIS M G，PARR R.Least-squares policy iteration[J].Journal of Machine Learning Research，2003，4（6）：1107-1149.
[19] BU?ONIU L，ERNST D，DE SCHUTTER B，et al.Online least-squares policy iteration for reinforcement learning control[C]//Proceedings of the 2010 American Control Conference，2010：486-491.
[20] 周鑫，刘全，傅启明，等.一种批量最小二乘策略迭代方法[J].计算机科学，2014，41（9）：232-238.
ZHOU X，LIU Q，FU Q M，et al.Batch least-squares policy iteration[J].Computer Science，2014，41（9）：232-238.
[21] 程玉虎，冯涣婷，王雪松.基于状态-动作图测地高斯基的策略迭代强化学习[J].自动化学报，2011，37（1）：44-51.
CHENG Y H，FENG H T，WANG X S.Policy iteration reinforcement learning based on geodesic Gaussian basis defined on state-action graph[J].Acta Automatica Sinica，2011，37（1）：44-51.
[22] SONG T，LI D，CAO L，et al.Kernel-based least squares temporal difference with gradient correction[J].IEEE Transactions on Neural Networks and Learning Systems，2015，27（4）：771-782.
[23] 季挺，张华.基于状态聚类的非参数化近似广义策略迭代增强学习算法[J].控制与决策，2017，32（12）：2153-2161.
JI T，ZHANG H.Nonparametric approximation generalized policy iteration reinforcement learning algorithm based on states clustering[J].Control and Decision，2017，32（12）：2153-2161.
[24] KOLTER J Z，NG A Y.Regularization and feature selection in least-squares temporal difference learning[C]//Proceedings of the 26th Annual International Conference on Machine Learning（ICML），2009：521-528.
[25] CHEN S，GENG C，GU R.An efficient L2-norm regularized least-squares temporal difference learning algorithm[J].Knowledge-Based Systems，2013，45（6）：94-99.
[26] KIM M S，HONG G G，LEE J J.Online fuzzy Q-learning with extended rule and interpolation technique[C]//1999 IEEE/RSJ International Conference on Intelligent Robots and Systems，1999：757-762.
[27] SHI H，LI X，HWANG K S，et al.Decoupled visual servoing with fuzzy Q-learning[J].IEEE Transactions on Industrial Informatics，2016，14（1）：241-252.
[28] DERHAMI V，MAJD V J，AHMADABADI M N.Fuzzy Sarsa learning and the proof of existence of its stationary points[J].Asian Journal of Control，2008，10（5）：535-549.
[29] HUANG J，ANGELOV P P，YIN C.Interpretable policies for reinforcement learning by empirical fuzzy sets[J].Engineering Applications of Artificial Intelligence，2020，91：103559.
[30] 刘智斌，曾晓勤，徐彦，等.采用资格迹的神经网络学习控制算法[J].控制理论与应用，2015，32（7）：887-894.
LIU Z B，ZENG X Q，XU Y，et al.Learning to control by neural networks using eligibility traces[J].Control Theory and Applications，2015，32（7）：887-894.
[31] ZHANG F，DUAN S，WANG L.Route searching based on neural networks and heuristic reinforcement learning[J].Cognitive Neurodynamics，2017，11（3）：245-258.
[32] PAN J，WANG X，CHENG Y，et al.Multi-source transfer ELM-based Q learning[J].Neurocomputing，2014，137（11）：57-64.
[33] 张耀中，胡小方，周跃，等.基于多层忆阻脉冲神经网络的强化学习及应用[J].自动化学报，2019，45（8）：1536-1547.
ZHANG Y Z，HU X F，ZHOU Y，et al.A novel reinforcement learning algorithm based on multilayer memristive spiking neural network with applications[J].Acta Automatica Sinica，2019，45（8）：1536-1547.
[34] 闵华清，曾嘉安，罗荣华，等.一种状态自动划分的模糊小脑模型关节控制器值函数拟合方法[J].控制理论与应用，2011，28（2）：256-260.
MIN H Q，ZENG J A，LUO R H，et al.Fuzzy cerebellar model arithmetic controller with automatic state partition for value function approximation[J].Control Theory and Applications，2011，28（2）：256-260.
[35] 季挺，张华.基于CMAC的非参数化近似策略迭代增强学习[J].计算机工程与应用，2019，55（2）：128-136.
JI T，ZHANG H.Nonparametric approximation policy iteration reinforcement learning based on CMAC[J].Computer Engineering and Applications，2019，55（2）：128-136.
[36] MNIH V，KAVUKCUOGLU K，SILVER D，et al.Human-level control through deep reinforcement learning[J].Nature，2015，518：529-533.
[37] HASSELT H，GUEZ A，SILVER D.Deep reinforcement learning with double Q-Learning[C]//Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence，2016：2094-2100.
[38] SCHAUL T，QUAN J，ANTONOGLOU I，et al.Prioritized experience replay[J].arXiv：1511.05952，2015.
[39] WANG Z，SCHAUL T，HESSEL M，et al.Dueling network architectures for deep reinforcement learning[C]//International Conference on Machine Learning（ICML），2016：1995-2003.
[40] HAUSKNECHT M，STONE P.Deep recurrent q-learning for partially observable mdps[J].arXiv：1507.06527，2015.
[41] WILLIAMS R J.Simple statistical gradient-following algorithms for connectionist reinforcement learning[J].Machine Learning，1992，8（3/4）：229-256.
[42] BAXTER J，BARTLETT P L.Infinite-horizon policy-gradient estimation[J].Journal of Artificial Intelligence Research，2001，15（1）：319-350.
[43] ZHAO T，NIU G，XIE N，et al.Regularized policy gradients：direct variance reduction in policy gradient estimation[C]//7th Asian Conference on Machine Learning（ACML），2015：333-348.
[44] VIEN N A，YU H，CHUNG T C.Hessian matrix distribution for Bayesian policy gradient reinforcement learning[J].Information Sciences，2011，181（9）：1671-1685.
[45] XU T，LIU Q，PENG J.Stochastic variance reduction for policy gradient estimation[J].arXiv：1710.06034，2017.
[46] 程玉虎，冯焕婷，王雪松.基于参数探索的期望最大化策略搜索[J].自动化学报，2012，38（1）：38-45.
CHENG Y H，FENG H T，WANG X S.Expectation-maximization policy search with parameter-based exploration[J].Acta Automatica Sinica，2012，38（1）：38-45.
[47] HACHIYA H，PETERS J，SUGIYAMA M.Reward-weighted regression with sample reuse for direct policy search in reinforcement learning[J].Neural Computation，2011，23（11）：2798-2832.
[48] HWANG K S，LIN J L，SHI H，et al.Policy learning with human reinforcement[J].International Journal of Fuzzy Systems，2016，18（4）：618-629.
[49] SCHULMAN J，LEVINE S，MORITZ P，et al.Trust region policy optimization[J].arXiv：1502.05477v5，2015.
[50] SILVER D，LEVER G，HEESS N，et al.Deterministic policy gradient algorithms[C]//International Conference on Machine Learning（ICML），2014：387-395.
[51] BARTO A G，SUTTON R S，ANDERSON C W.Neuronlike adaptive elements that can solve difficult learning control problems[J].IEEE Transactions on Systems，Man，and Cybernetics，1983（5）：834-846.
[52] SUTTON R S.Temporal credit assignment in reinforcement learning[D].University of Massachusetts at Amherst，1984.
[53] ANDERSON C W.Learning and problem solving with multilayer connectionist systems[D].University of Massachusetts at Amherst，1986.
[54] GRONDMAN I，BUSONIU L，LOPES G A D，et al.A survey of actor-critic reinforcement learning：standard and natural policy gradients[J].IEEE Transactions on Systems，Man，and Cybernetics，Part C，2012，42（6）：1291-1307.
[55] CHENG Y H，YI J Q，ZHAO D B.Application of actor-critic learning to adaptive state space construction[C]//Proceedings of 2004 International Conference on Machine Learning and Cybernetics，2004：2985-2990.
[56] WANG X S，CHENG Y H，YI J Q.A fuzzy Actor-Critic reinforcement learning network[J].Information Sciences，2007，177（18）：3764-3781.
[57] BHATNAGAR S.An actor-critic algorithm with function approximation for discounted cost constrained Markov decision processes[J].Systems & Control Letters，2010，59（12）：760-766.
[58] LEE D H，LEE J J.Incremental receptive field weighted actor-critic[J].IEEE Transactions on Industrial Informatics，2013，9（1）：62-71.
[59] PETERS J，VIJAYAKUMAR S，SCHAAL S.Reinforcement learning for humanoid robotics[C]//Proceedings of the Third IEEE-RAS International Conference on Humanoid Robots，2003：1-20.
[60] 朱斐，朱海军，刘全，等.一种解决连续空间问题的真实在线自然梯度AC算法[J].软件学报，2018，29（2）：267-282.
ZHU F，ZHU H J，LIU Q，et al.True online natural Actor-Critic algorithm for the continuous space problem[J].Journal of Software，2018，29（2）：267-282.
[61] 钟珊，刘全，傅启明，等.一种采用模型学习和经验回放加速的正则化自然行动器评判器算法[J].计算机学报，2019，42（3）：82-103.
ZHONG S，LIU Q，FU Q M，et al.A regularized natural AC algorithm with the acceleration of model learning and experience replay[J].Chinese Journal of Computers，2019，42（3）：82-103.
[62] LILLICRAP T P，HUNT J J，PRITZEL A，et al.Continuous control with deep reinforcement learning[J].arXiv：1509.02971，2015.
[63] FUJIMOTO S，HOOF H，MEGER D.Addressing function approximation error in actor-critic methods[J].arXiv：1802.09477v3，2018.
[64] MNIH V，BADIA A P，MIRZA M，et al.Asynchronous methods for deep reinforcement learning[C]//International Conference on Machine Learning（ICML），2016：1928-1937.
[65] HAARNOJA T，ZHOU A，ABBEEL P，et al.Soft actor-critic：off-policy maximum entropy deep reinforcement learning with a stochastic actor[J].arXiv：1801.01290v2，2018.
[66] SCHULMAN J，WOLSKI F，DHARIWAL P，et al.Proximal policy optimization algorithms[J].arXiv：1707.06347v2，2017.
[67] WANG Y，LI X，ZHANG J，et al.Review of wheeled mobile robot collision avoidance under unknown environment[J].Science Progress，2021，104（3）：00368504211037771.
[68] TAI L，LIU M.Mobile robots exploration through CNN-based reinforcement learning[J].Robotics and Biomimetics，2016，3（1）：1-8.
[69] TAI L，LI S H，LIU M.A deep-network solution towards modeless obstacle avoidance[C]//IEEE/RSJ International Conference on Intelligent Robots and Systems.Piscataway，USA：IEEE，2016：2759-2764.
[70] ZHU Y，MOTTAGHI R，KOLVE E，et al.Target-driven visual navigation in indoor scenes using deep reinforcement learning[C]//2017 IEEE International Conference on Robotics and Automation（ICRA），2017：3357-3364.
[71] LEE H S，JEONG J.Mobile robot path optimization technique based on reinforcement learning algorithm in warehouse environment[J].Applied Sciences，2021，11（3）：1209.
[72] SAMSANI S S，MUHAMMAD M S.Socially compliant robot navigation in crowded environment by human behavior resemblance using deep reinforcement learning[J].IEEE Robotics and Automation Letters，2021，6（3）：5223-5230.
[73] DE JESUS J C，BOTTEGA J A，DE SOUZA LEITE M A，et al.Deep deterministic policy gradient for navigation of mobile robots[J].Journal of Intelligent & Fuzzy Systems，2021，40：349-361.
[74] CHU Z，SUN B，ZHU D，et al.Motion control of unmanned underwater vehicles via deep imitation reinforcement learning algorithm[J].IET Intelligent Transport Systems，2020，14（7）：764-774.
[75] YOU S，DIAO M，GAO L，et al.Target tracking strategy using deep deterministic policy gradient[J].Applied Soft Computing，2020，95：106490.
[76] LIN X B，LIU J，YU Y，et al.Event-triggered reinforcement learning control for the quadrotor UAV with actuator saturation[J].Neurocomputing，2020，415：135-145.
[77] EVANGELOS P，FARHAD A，MA O，et al.Robotic manipulation and capture in space：a survey[J].Frontiers in Robotics and AI，2021，8：686723.
[78] HU Y Z，WANG W X，LIU H，et al.Reinforcement learning tracking control for robotic manipulator with kernel-based dynamic model[J].IEEE Transactions on Neural Networks and Learning Systems，2020，31（9）：3570-3578.
[79] KIM K，HAN D K，PARK J H，et al.Motion planning of robot manipulators for a smoother path using a twin delayed deep deterministic policy gradient with hindsight experience replay[J].Applied Sciences，2020，10（2）：575.
[80] LIN G C，ZHU L X，LI J H，et al.Collision-free path planning for a guava-harvesting robot based on recurrent deep reinforcement learning[J].Computers and Electronics in Agriculture，2021，188：106350.
[81] JIANG D，CAI Z Q，PENG H J，et al.Coordinated control based on reinforcement learning for dual-arm continuum manipulators in space capture missions[J].Journal of Aerospace Engineering，2021，34（6）：04021087.
[82] WONG C C，CHIEN S Y，FENG H M，et al.Motion planning for dual-arm robot based on soft actor-critic[J].IEEE Access，2021，9：26871-26885.