Maximum Entropy Inverse Reinforcement Learning Based on Generative Adversarial Networks

doi:10.3778/j.issn.1002-8331.1904-0238

Computer Engineering and Applications ›› 2019, Vol. 55 ›› Issue (22): 119-126.DOI: 10.3778/j.issn.1002-8331.1904-0238

Previous Articles Next Articles

Maximum Entropy Inverse Reinforcement Learning Based on Generative Adversarial Networks

CHEN Jianping, CHEN Qiqiang, FU Qiming, GAO Zhen, WU Hongjie, LU You

1.College of Electronics and Information Engineering, Suzhou University of Science and Technology, Suzhou, Jiangsu 215009, China
2.Jiangsu Province Key Laboratory of Intelligent Building Energy Efficiency, Suzhou University of Science and Technology, Suzhou, Jiangsu 215009, China

Online:2019-11-15 Published:2019-11-13

基于生成对抗网络的最大熵逆强化学习

陈建平，陈其强，傅启明，高振，吴宏杰，陆悠

1.苏州科技大学电子与信息工程学院，江苏苏州 215009
2.苏州科技大学江苏省建筑智慧节能重点实验室，江苏苏州 215009

Abstract

Abstract: Aiming at the problem that the inverse reinforcement learning algorithm is slow in learning rate due to the sparseness of expert samples in the early stage of training, a maximum entropy inverse reinforcement learning algorithm based on Generative Adversarial Networks（GAN） is proposed. In the learning process, the expert samples are used to train and optimize the GAN to generate the virtual expert samples. Based on this, the non-expert samples are generated by using the stochastic policy and the mixed sample set is constructed. The maximum entropy probability model is combined to model the reward function, and the gradient descent method is used to solve the optimal reward function. Based on the optimal reward function, the forward reinforcement learning method is used to solve the optimal policy. On this basis, non-expert samples are further generated, the mixed sample set is reconstructed, and the optimal reward function is solved iteratively. The proposed algorithm and MaxEnt IRL algorithm are applied to the classic Object World and Mountain Car problems. Experiments show that the algorithm can solve the reward function better when the expert samples are sparse, and has better convergence performance.

Key words: Generative Adversarial Networks（GAN）, inverse reinforcement learning, maximum entropy

摘要： 针对逆强化学习算法在训练初期由于专家样本稀疏所导致的学习速率慢的问题，提出一种基于生成对抗网络（Generative Adversarial Networks，GAN）的最大熵逆强化学习算法。在学习过程中，结合专家样本训练优化生成对抗网络，以生成虚拟专家样本，在此基础上利用随机策略生成非专家样本，构建混合样本集，结合最大熵概率模型，对奖赏函数进行建模，并利用梯度下降方法求解最优奖赏函数。基于所求解的最优奖赏函数，利用正向强化学习方法求解最优策略，并在此基础上进一步生成非专家样本，重新构建混合样本集，迭代求解最优奖赏函数。将所提出的算法与MaxEnt IRL算法应用于经典的Object World与Mountain Car问题，实验表明，该算法在专家样本稀疏的情况下可以较好地求解奖赏函数，具有较好的收敛性能。

关键词: 生成对抗网络（GAN）, 逆强化学习, 最大熵

CHEN Jianping, CHEN Qiqiang, FU Qiming, GAO Zhen, WU Hongjie, LU You. Maximum Entropy Inverse Reinforcement Learning Based on Generative Adversarial Networks[J]. Computer Engineering and Applications, 2019, 55(22): 119-126.

陈建平，陈其强，傅启明，高振，吴宏杰，陆悠. 基于生成对抗网络的最大熵逆强化学习[J]. 计算机工程与应用, 2019, 55(22): 119-126.

[1]	CHAI Xu, FANG Ming, FU Feiran, SHAO Zhen. Sight Estimation Algorithms for Examinee in Examination Room Environment [J]. Computer Engineering and Applications, 2021, 57(9): 199-206.
[2]	XU Xiaochun, DONG Hongwei, WEI Chengfeng. Application of Improved CAGAN in Virtual Try-on [J]. Computer Engineering and Applications, 2021, 57(6): 152-158.
[3]	WANG Haiyong, LI Haiyang, GAO Xuejiao. Research on Image Restoration Method Based on Structure Embedding [J]. Computer Engineering and Applications, 2021, 57(22): 241-246.
[4]	WU Xin, HUANG Bo, FANG Zhijun, LIU Wenzhu. Application of Sequence Generative Adversarial Network in Recommendation System [J]. Computer Engineering and Applications, 2020, 56(23): 175-179.
[5]	LIU Youyong, ZHANG Jiangmei, WANG Kunpeng, FENG Xinghua, YANG Xiuhong. Fast Underwater Target Recognition with Unbalanced Data Set [J]. Computer Engineering and Applications, 2020, 56(17): 236-242.
[6]	YANG Yanrong, SONG Rongjie, ZHOU Zhaoyong. Network Intrusion Detection Method Based on GAN-PSO-ELM [J]. Computer Engineering and Applications, 2020, 56(12): 66-72.
[7]	ZHOU Wanying, MA Yingcang, XU Qiuxia, ZHENG Yi. Unsupervised Feature Selection Algorithm Based on Maximum Entropy and [l2,0] Norm Constraints [J]. Computer Engineering and Applications, 2020, 56(11): 51-59.
[8]	YONG Qiaoling, YI Junyan. Elastic Net Algorithm with Dynamic Characteristics for Clustering [J]. Computer Engineering and Applications, 2019, 55(8): 102-109.
[9]	LI Shuailong1，2，3, ZHANG Huiwen1，2，3, ZHOU Weijia1，2. Review of Imitation Learning Methods and Its Application in Robotics [J]. Computer Engineering and Applications, 2019, 55(4): 17-30.
[10]	XING Enxu, WU Xiaoyong, LI Yaxian. Double-Layer Generative Adversarial Networks Based on Transfer Learning [J]. Computer Engineering and Applications, 2019, 55(15): 38-46.
[11]	XIA Wuji1，2, HUAQUE Cairang1. Research of tibetan personal pronouns anaphora resolution based on mixed strategy [J]. Computer Engineering and Applications, 2018, 54(7): 66-69.
[12]	CHEN Xiliang, CAO Lei, HE Ming, LI Chenxi, XU Zhixiong. Overview of deep inverse reinforcement learning [J]. Computer Engineering and Applications, 2018, 54(5): 24-35.
[13]	SHAO Liangshan1, ZHAO Linlin1, WEN Tingxin2, KONG Xiangbo2. Bidirectional projection method with interval-valued intuitionistic fuzzy number [J]. Computer Engineering and Applications, 2017, 53(1): 83-86.
[14]	WU Bin, MA Jitao, WU Ping. Expert random selection algorithm based on information entropy [J]. Computer Engineering and Applications, 2016, 52(5): 119-121.
[15]	LIU Ying, WANG Nan. Comparison of clause alignment based on maximum entropy model and Back Propagation neural network model [J]. Computer Engineering and Applications, 2015, 51(7): 112-117.

Maximum Entropy Inverse Reinforcement Learning Based on Generative Adversarial Networks

基于生成对抗网络的最大熵逆强化学习

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics